This report investigates the relative fossil-fuel consumption intensity (as a percentage of total energy consumption) for the United States at the state level in 2015. Understanding the relative contributions (or lack thereof) of economic, natural resource, political, and human capital factors to the structure of a well-fit model was of especial interest.
Through exploratory data analysis followed by model-building (including several diagnostics and model iterations), a final model was reached which adequately satisfies necessary conditions for valid multiple linear regression.
For the past several decades, the scientific consensus has become increasingly supportive of the theory of anthropogenic climate change, driven largely by fossil fuel consumption stretching back to the coal burnt in Industrial Revolution-era England and continuing to the present day. In light of research in this area and concerns of potential consequences for continued carbon-intensive processes, interest has emerged in decoupling economic activity from activities which release greenhouse gasses into the atmosphere.
The ultimate goal of this analysis was to develop a better understanding of which factors have the strongest association with the ‘carbon footprint’ of a state’s energy consumption. For example, based on regression modeling, is it reasonable to argue that there is a complex (potential indiscernible) mix of economic and political and human capital and natural resource factors associated with the fossil fuel intensity of a state’s energy consumption, or does one or multiple component(s) of these factors seem to best relate to fossil fuel consumption?
Given the constraints of time and available knowledge, the analysis relied on evaluating a variety of factors around energy consumption for the 50 states in the most recent year for which (hopefully) sufficient data was available - in this case, 2015. The outcome focus of the analysis is developing a multiple regression model which both satisfies all standard conditions (independence of observations, normality of errors, etc.) and possesses strong ‘explanatory’ utility, in the sense of providing a good fit of the outcome variable of interest (per-state fossil fuel consumption as a percentage of total energy consumed, taken as a snapshot from 2015). This fit will be assessed using the metrics R2adj and AIC.
The below subsections explore the various variables explored as potentially having association with the carbon intensity of each state’s energy consumption in 2015. Data was gathered from a variety of sources, with most being government agencies charged with collecting and reporting information relating to domestic energy, economic, and/or demographic developments.
Regarding potential biases in the data, there may be some concern of data collection methods resulting in bias away from certain subgroups of the population(s) under consideration. For example, demographic data collected by the Census Bureau may systemically under-count lower-income, lesser-educated populations which may not be as readily accessible as others for the purposes of surveying.
EIA consumption and production data appear to be largely sourced from industrial partners (e.g. geothermal consumption is reported to the EIA using form EIA-923, “Power Plant Operations Report”), which is subject to the accuracy of such reporting. Additionally, consumption and production data were evaluated where each energy source was converted from its ‘native’ unit of reference (e.g. kilowatt-hours for nuclear energy consumption, or short tons for coal production) into British Thermal Units for the sake of general comparability. This required that the EIA apply conversion factors to the ‘native’ reference unit to approximate a BTU equivalent (e.g. million BTU per short ton for coal). The accuracy of such conversions is subject to the precision with which the conversion factors have been estimated - it stands to reason that older studies using less-efficient mechanisms (e.g. an outdated power plant) would downweight the energy potential in a given unit of the energy source under investigation.
That said, the EIA methodology documentation seems to indicate the collection methodology is fairly robust. I am operating under the belief that, because the observations apply to entire states and the data examined is provided by entities largely dedicated to reporting on the domains of interest, the analysis is generally robust against reasonably small amounts of bias.
As for potential confounding variables, one definite confounder that influenced the direction of the analysis was the result of investigating EIA data methodology. While confirming state-level consumption and production figures, I learned that renewable energy (geothermal, hydroelectric, solar, and wind) and nuclear energy production was assumed to be exactly equal to that of consumption for each state. This seems problematic, as it would preclude the possibility of accurately reflecting a power plant or energy generation site in one state producing energy (say near the border with another state) and transmitting it to customers in another state. As such, when evaluating individual variables, I considered only consumption for these energy sources (to prevent double-counting) and focused on energy sources with distinct production and consumption data (fossil fuels) when evaluating outcome variables of interest.
One potential confounding variable not considered in the analysis is the ‘sunk cost’ element of each state’s energy infrastructure, and its interplay with the relative economical attractiveness of various energy sources. For example, if a state has many coal plants at the end of their useful lives as wind turbines are rapidly descending the cost curve, and the state it well-situated (financially and geographically) to do so, it seems reasonable that energy companies in the state will be more amenable to installing new wind capacity, which will be reflected in subsequent years’ data. Another state that is the same in all other aspects save that it has much newer coal plants may be more inclined to either remain with coal consumption or to convert the coal plants to natural gas. Not only is the age/condition of the infrastructure important in this equation, but so too is the relative costs of different energy sources in terms of installation or conversion. This would be fundamentally critical to developing a strong model over a time series, and is likely still an important consideration for a single-year ‘snapshot’ analysis as with this analysis - however I am not sure of an appropriate approach to accounting for such a consideration.
This data comes from the Energy Information Administration State Energy Data Systems (SEDS) site at the following link: EIA SEDS site
Energy sources were classified in the following manner for the purposes of the analysis:
* Biomass (e.g. wood pellets, corn ethanol) is considered renewable energy by the EIA, but due to the contested carbon neutrality of biomass, it was removed from the renewables category for this analysis. The short summary is that the combustion output of biomass such as wood pellets can have greater particulate matter (see abstract) and similar carbon dioxide levels as fossil fuel sources.
Factoring in energy source lifecycle details (land use, fossil fuel-based fertilizer, etc.) is also uncertain as to the superiority or equivalency/inferiority of biomass versus fossil fuels. For related commentary, the right side of page 5 in the linked PDF report may be of interest.
Note: Based on post-presentation feedback, the below code (and as a result the dataset) was updated to bring in TPOPP, the variable for state population in thousands. Per the EIA technical notes documentation linked after this note, this is provided by the Census Bureau.SEDS 2015 Consumption Technical Notes
# Loading the data file
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
energy2015 <- read.csv("Complete_SEDS_update_2015.csv")
# Retain only variables of interest:
# CONSUMPTION
# Note: Geotherm/Hydro/Solar/Wind/Nuclear appear to be treated as both
# **consumption** and production in EIA tables C1 and P2 for 2015 by state tables
# Therefore for Renewables and Nuclear, only evaluating from Consumption vars
# except for calculating Total Consumption/Production
# Renewable Energy (RE) consumption=production: Sum of the following:
# Geothermal consumption: GETCB
# Hydropower consumption: HYTCB
# Solar consumption: SOTCB
# Wind consumption: WYTCB
# Biomass total consumption: BMTCB
# Sub-components: EMLCB, EMTCB, WWTCB
# Nuclear consumption: NUETB
# Coal consumption: CLTCB
# Natural Gas consumption: NGTCB minus the following:
# Supplemental gaseous fuels: SFTCB
# Petroleum consumption: PMTCB
# PRODUCTION
# Biomass production:
# Biomass inputs for the production of fuel ethanol: EMFDB plus the following:
# Wood and waste total consumed : WWTCB
# Geothermal production=consumption: GETCB
# Hydropower production=consumption: HYTCB
# Solar production=consumption: SOTCB
# Wind production=consumption: WYTCB
# Nuclear production=consumption: NUETB
# Coal production: CLPRB
# NatGas production: NGMPB
# Crude Oil production: PAPRB
# State population in thousands: TPOPP
keeplistEnergy2015 <- c("BMTCB", "CLTCB", "NGTCB", "PMTCB", "EMFDB",
"GETCB", "HYTCB", "SOTCB", "WYTCB", "NUETB", "CLPRB",
"NGMPB", "PAPRB","ELISB", "SFTCB", "WWTCB", "TPOPP")
# Converting metrics from Billion BTUs to Trillion BTUs, filtering out the US (total US) entries
library(dplyr)
library(tidyr)
energy2015 <- energy2015 %>% filter(MSN %in% keeplistEnergy2015) %>%
mutate(state = StateCode) %>% select(state, MSN, Data) %>%
spread(key = MSN, value = Data) %>%
mutate(gthrmECons = GETCB/1000,
hydroECons = HYTCB/1000,
solarECons = SOTCB/1000,
windECons = WYTCB/1000,
rETotCons = gthrmECons + hydroECons + solarECons + windECons,
biomassECons = BMTCB/1000,
nclrECons = NUETB/1000,
coalECons = CLTCB/1000,
natGasECons = (NGTCB - SFTCB)/1000,
petroECons = PMTCB/1000,
ffETotCons = coalECons + natGasECons + petroECons,
biomassEProd = (EMFDB + WWTCB)/1000,
coalEProd = CLPRB/1000,
natGasEProd = NGMPB/1000,
petroEProd = PAPRB/1000,
ffETotProd = coalEProd + natGasEProd + petroEProd,
totECons = rETotCons + biomassECons + nclrECons + ffETotCons,
totEProd = rETotCons + biomassEProd + nclrECons + ffETotProd,
eConsIndex = (rETotCons - ffETotCons)*100/totECons,
eConsPctRE = rETotCons*100/totECons,
eConsPctFF = ffETotCons*100/totECons,
netEBalance = totEProd - totECons,
statePopK = TPOPP) %>%
select(state, gthrmECons, hydroECons, solarECons,windECons, rETotCons,
biomassECons, nclrECons, coalECons, natGasECons, petroECons, ffETotCons,
biomassEProd, coalEProd, natGasEProd, petroEProd, ffETotProd,
totECons, totEProd, eConsIndex, eConsPctRE, eConsPctFF, netEBalance, statePopK) %>%
filter(state != "US")
# Save data file
write.csv(energy2015, file = "xenergy2015.csv", row.names = FALSE)% State GDP by industry sector data comes from a generated table from the BEA at the following link: BEA Regional Data Table Generator site
Note: Since I’m converting dollars to (per)cents from this data, I used nominal dollars instead of the inflation-adjusted ‘chained 2009 dollars’ version available - the per capita GDP by state controls for inflation.
State GDP per Capita data comes from a generated table from the BEA at the following link: BEA 2015 Per Capita Data Table Generator site 2
The code below process both data sources for analysis:
# State GDP by Industry
# Loading the data file - first two rows are header details,
# and inputting the variable names to conform to R standards
# (no vars starting with numbers, no colons in the var name)
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
library(readxl)
stateGDPInd2015 <- read_excel("BEA_2015_byInd_StateGDP.xls", skip = 3,
col_names = c("Fips", "Area", "IndCode",
"Industry", "Q1_2015", "Q2_2015",
"Q3_2015", "Q4_2015"),
col_types = c("text", "text", "text", "text",
"numeric", "numeric", "numeric", "numeric"))
# Processing the datasheet to "wide data" format,
# and calculating percentage of state GDP by select industry groupings
'%!in%' <- function(x,y)!('%in%'(x,y))
library(dplyr)
library(stringr)
library(tidyr)
# Note: Need to AVERAGE quarterly estimates to match 2015 annual results
# Note: For Delaware 'Agriculture, forestry, fishing, and hunting' and
# 'Mining, quarrying, and oil and gas extraction' data are not disclosed,
# and likewise for Rhode Island (same two categories)
# "in order to avoid the disclosure of confidential information" (per BEA file);
# because the data ARE included in the 'All industry total' category for the two
# states but exact quantities cannot be determined, the missing data for each state
# is allocated equally among the two categories for the purposes of the analysis
stateGDPInd2015 <- stateGDPInd2015 %>%
mutate(state = state.abb[match(Area,state.name)],
state = if_else((is.na(state) & Area == "District of Columbia"),"DC",state),
tot2015 = (Q1_2015 + Q2_2015 + Q3_2015 + Q4_2015) / 4,
tot2015 = if_else(is.na(tot2015), 0, tot2015)) %>%
select(state, Industry, tot2015) %>%
mutate(Industry = str_remove_all(Industry, " ")) %>%
mutate(Industry = str_remove_all(Industry, ",")) %>%
select(state, Industry, tot2015) %>%
spread(key = Industry, value = tot2015) %>%
mutate(Agricultureforestryfishingandhunting =
if_else(state == "DE",1943/2,
if_else(state == "RI", 665/2,
Agricultureforestryfishingandhunting)),
Mining =
if_else(state == "DE",1943/2,
if_else(state == "RI", 665/2,
Mining)),
gdpBusProfSvc = Information + Financeandinsurance +
Realestateandrentalandleasing +
Professionalscientificandtechnicalservices +
Managementofcompaniesandenterprises +
Administrativeandwastemanagementservices,
gdpGovt = Government,
gdpHosp = Accommodationandfoodservices,
gdpManuf = Manufacturing,
gdpMining = Mining,
gdpUtil = Utilities,
gdpOth = Artsentertainmentandrecreation +
Agricultureforestryfishingandhunting +
Construction +
Educationalservices +
Healthcareandsocialassistance +
Retailtrade +
Transportationandwarehousing +
Wholesaletrade +
Otherservicesexceptgovernment,
gdpTotMn = gdpBusProfSvc + gdpGovt + gdpHosp + gdpManuf + gdpMining +
gdpUtil + gdpOth,
gdpPctBusProfSvc = gdpBusProfSvc*100 / gdpTotMn,
gdpPctGovt = gdpGovt*100 / gdpTotMn,
gdpPctHosp = gdpHosp*100 / gdpTotMn,
gdpPctManuf = gdpManuf*100 / gdpTotMn,
gdpPctMining = gdpMining*100 / gdpTotMn,
gdpPctUtil = gdpUtil*100 / gdpTotMn,
gdpPctOth = gdpOth*100 / gdpTotMn) %>%
select(state, gdpTotMn, gdpPctBusProfSvc, gdpPctGovt, gdpPctHosp, gdpPctManuf,
gdpPctMining, gdpPctUtil, gdpPctOth)
# State GDP per capita
# Loading the data file - first five rows are header details, and sixth (header)
# row has a variable name that is numeric, last four rows are table notes
#library(readxl)
stateGDPPerCap2015 <- read_excel("BEA_2015_PerCap_StateGDP.xls", skip = 6,
n_max = 51, col_name = c("FIPS", "Area", "X2015"),
col_types = c("text", "text", "numeric"))
# Processing the data to "wide data" format
#library(dplyr)
#library(tidyr)
stateGDPPerCap2015 <- stateGDPPerCap2015 %>%
mutate(state = state.abb[match(Area,state.name)],
state = if_else((is.na(state) & Area == "District of Columbia"),"DC",state),
gdpPerCapK = X2015 / 1000) %>%
select(state, gdpPerCapK)
# Joining State GDP by Industry and State GDP per capita datasets
stateGDP2015 <- left_join(stateGDPPerCap2015, stateGDPInd2015, by = "state")
# Save data file
write.csv(stateGDP2015, file = "xstateGDP2015.csv", row.names = FALSE)Per capita personal income data comes from the U.S. Census Bureau’s American FactFinder site at the following link: Census Bureau Table B19301 site
I added the states (Alabama through Wyoming) using the ‘Add/Remove Geographies’ option at the upper left of the page, and then saved a CSV file.
Highest education level attained among the age 25yrs+ population by state comes from a different table of the same American FactFinder site, at the following link: Census Bureau Table S1501 site
The education data required significant manual processing:
Both sets of data were processed using the below code:
# Personal Income per capita, 2015
# Loading the data file
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
stateIncPerCap <- read.csv("PerCapIncState_ACS_2015_B19301.csv")
# Processing the file to extract per capita personal income in 1,000 USD
library(dplyr)
stateIncPerCap <- stateIncPerCap %>%
mutate(state = if_else(Geography == "District of Columbia","DC",
state.abb[match(Geography,state.name)]),
persIncPerCapK = Estimate / 1000) %>%
select(state, persIncPerCapK)
## Education Level among age 25+ population, 2015
# Loading the data file
stateEdu25Yrs <- read.csv("StateEdu&Povty25yrsplus_ACS_2015_S1501.csv")
# Creating summary percentage variables by education level
stateEdu25Yrs <- stateEdu25Yrs %>%
mutate(state = if_else(Geography == "District of Columbia","DC",
state.abb[match(Geography,state.name)]),
eduPctLTHS = Edu.LT.9th.pct + Edu.9th12th.pct,
eduPctHS = Edu.HSgrad.pct,
eduPctSmColAssoc = Edu.SmCollgNoDeg.pct + Edu.AssocDeg.pct,
eduPctBac = Edu.BachDeg.pct,
eduPctGradProf = Edu.GradProfDeg.pct) %>%
select(state, eduPctLTHS, eduPctHS, eduPctSmColAssoc, eduPctBac, eduPctGradProf)
# Combining the two files and saving
persIncomeEduPct <- left_join(stateIncPerCap, stateEdu25Yrs, by = "state")
write.csv(persIncomeEduPct, file = "xpersIncomeEduPct.csv", row.names = FALSE)Note: The NASA POWER data took the most time/code by far to wrangle into a usable format - the next subsection is fairly detailed (lengthy).
This data comes from the NASA POWER API at the following link: NASA POWER API site
Suffice to say this required some processing to get to the data of interest (U.S.-only metrics).
# Reading in the data
library(readxl)
global <- read_excel("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/GlobalDat_TotInsolWnd50m.xlsx")The following code ‘maps’ latitude and longitude coordinates provided in NASA POWER data to individual countries aligning with those coordinates. (found on StackExchange - credit and source at the bottom of the code chunk)
# Apply StackExchange-provided process to map Lat & Long coordinates to countries
library(sp)
library(rworldmap)
library(rworldxtra)
# The single argument to this function, points, is a data.frame in which:
# - column 1 contains the longitude in degrees
# - column 2 contains the latitude in degrees
coords2country = function(coords)
{
#countriesSP <- getMap(resolution='low')
countriesSP <- getMap(resolution='high') #you could use high res map from
# rworldxtra if you were concerned about detail
# convert our list of points to a SpatialPoints object
#setting CRS directly to that from rworldmap
coordsSP = SpatialPoints(coords, proj4string=CRS(proj4string(countriesSP)))
# use 'over' to get indices of the Polygons object containing each point
indices = over(coordsSP, countriesSP)
# return the ADMIN names of each country
indices$ADMIN
#indices$ISO3 # returns the ISO3 code
#indices$continent # returns the continent (6 continent model)
#indices$REGION # returns the continent (7 continent model)
}
# Credit to StackOverflow user Andy for the above: https://stackoverflow.com/questions/14334970/convert-latitude-and-longitude-coordinates-to-country-name-in-rNASA POWER Data gives observations from around the world (literally); the below code filters to just those observations in the United States:
# Apply coordinates --> country mapping to the global dataset,
# filtering to only coordinates in the United States
library(dplyr)
coords <- global %>% select(LON, LAT)
global <- global %>% mutate(COUNTRY = coords2country(coords))
us <- global %>% filter(COUNTRY == "United States of America")
rm(coords, global)After running the coords2country function on the global data, there are 4,463 paired observations of insolation/wind speed per longitude/latitude pairing identified as falling in the United States - quite a bit less than the global data’s 259,200 pairs!
However, to be useful for the purposes of this modeling process, I needed to identify the states these points belonged to.
The following code chunk does this for points falling in the contiguous U.S. (i.e. all but Alaska and Hawaii - those two required a different approach):
As noted in the code chunk above, this function doesn’t ‘work for’ (map latitude/longitude coordinate pairs for) Alaska and Hawaii, so another approach is needed. All told, there were 1,205 distinct latitude/longitude coordinates not mapped to states at this point (1,123 were for Alaska - it’s a big state! - and the rest were mostly the northern regions of Michigan or Wisconsin, though Hawaii was also in there, as are locations along the edge of the Atlantic/Gulf/Pacific coasts and the Great Lakes region).
The next chunk of code separates observations based on whether or not there is a mapped state for that coordinate set, and the code chunk after that processes the state-missing cases by referencing the Google Maps API:
library(dplyr)
coords <- us %>% select(LON, LAT)
us <- us %>% mutate(state = latlong2state(coords))
# As noted before, HI and AK don't work with this method, so need to apply the
# methods in the next code chunk - first saving missing 'state' obs to another dataset
# Adding an index column to preserve current row order, then separating out 'state' = NA rows for next step
us$index <- c(1:nrow(us))
usMissing <- us %>% filter(is.na(state))
usNotMissing <- us %>% na.omit()
rm(coords)#library(dplyr)
library(ggmap)
usMissingStates <- usMissing %>% select(LON,LAT)
# Google Maps API limits a user to 2,500 queries per day,
# so OK running this process on the 2,410 'state'-missing observations
usMissingStates <- usMissingStates %>%
mutate(LOC = lapply(1:nrow(usMissingStates),
function(i){revgeocode(as.numeric(usMissingStates[i,]),
output = c("address"),
messaging = FALSE,
sensor = FALSE,
override_limit = FALSE)
}))
# Credit to StackOverflow user jlhoward for the above: https://stackoverflow.com/questions/22911642/applying-revgeocode-to-a-list-of-longitude-latitude-coordinates#22919546
# To cite ggmap in publications, please use:
# D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL
# http://journal.r-project.org/archive/2013-1/kahle-wickham.pdfUnfortunately the Google Maps API lookup didn’t correctly map every single state-missing case, so I needed to export the remaining state-missing cases and update them in a CSV file (just manually referencing the Google Maps API this time), and finally read back in the ‘no longer state-missing’ observations / knit together all the observations / save the final dataset with all state values filled in.
Before exporting to a CSV and manually updating, however, I was able to take advantage of the fact that I’d just run the Google Maps API over the same coordinate sets twice (once for each of insolation/wind speed) - the below code chunk arranges observations my matching coordinate sets and then carries in the filled-in address data if one of the two sets is filled-in (and the other is not).
Note: after running the following code chunk, I manually looked up and entered the state details for cases still missing ‘address’ information, and manually coded ‘state’ data based on the ‘address’ details. For example, if the Google Maps API lookup returned 123 Main St, Springfield, IL, I coded the state as ‘illinois’ (format matches the previous code output that matched non-AK/HI coordinates to states)
# For cases where one of the two variables' entries (same Lon/Lat values)
# has been coded to a location but the other has not, assign the location
# entry to the 'currently missing' case
#library(dplyr)
# Create PAIR to indicate which LON/LAT pairings align between the two parameters
# Then subset by PARAMETER
usMissing$LOC <- as.character(usMissingStates$LOC) # LOC is originally a list element
usMissing <- usMissing %>% arrange(PARAMETER,index) %>%
mutate(PAIR = rep(1:(nrow(usMissing)/2),2))
usMissingTotInsol <- usMissing %>% filter(PARAMETER == "ALLSKY_SFC_SW_DWN")
usMissingWind50m <- usMissing %>% filter(PARAMETER == "WS50M")
# Place the LOC entries of the one subset into the other subset for both
# Then, if one entry in the LON/LAT-paired subsets has provided entry data
# but not the other, assign the coded location data to the missing LOC
usMissingTotInsol <- usMissingTotInsol %>%
mutate(LOCoth = usMissingWind50m$LOC,
LOC = ifelse((LOC == "list(address = NA)" &
LOCoth != "list(address = NA)"), LOCoth,LOC)) %>%
select(-LOCoth)
usMissingWind50m <- usMissingWind50m %>%
mutate(LOCoth = usMissingTotInsol$LOC,
LOC = ifelse((LOC == "ist(address = NA)" &
LOCoth != "list(address = NA)"), LOCoth,LOC)) %>%
select(-LOCoth)
# Recombine the subsets and save the file
usMissing <- rbind(usMissingTotInsol, usMissingWind50m) %>%
arrange(PAIR, PARAMETER)
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
write.csv(usMissing, file = "usMissing.csv")Fortunately, thanks to the above code, I only had to manually look up 20 distinct latitude/longitude coordinates - not bad out of a total of 4,463!
# Read back in the new-and-improved usMissing data file which now has no missing cases
usMissingNoMore <- read.csv("usMissing.csv")
#library(dplyr)
library(stringi)
usMissingNoMore <- usMissingNoMore %>% select(-c(LOC, PAIR))
us <- rbind(usNotMissing, usMissingNoMore) %>%
arrange(index) %>%
filter(state != "ontario") %>% # one observation is on the Ontario side of NY-ON border
mutate(state =
state.abb[match(stri_trans_totitle(state),state.name)])
# convert to two-letter abbrev. to match other dataAfter all observations were matched to states, I calculated some summary data by state, and then saved the output.
# Summarizing solar insolation and wind speed @ 50m by state
# Note: Washington, D.C. does not have an observation,
# So if needed take the average of the two closest observations for each value
# Springfield, VA is about 15mi SW, and Upper Marlboro, MD is about 17mi SE
# Currently NO D.C. entries have been added
# Grouping by State and Variable and Summarizing
#library(dplyr)
usInsol <- us %>%
select(-LAT, -LON, -COUNTRY, -index) %>%
filter(PARAMETER == "ALLSKY_SFC_SW_DWN") %>%
group_by(state) %>%
summarize(janMean = mean(JAN), febMean = mean(FEB), marMean = mean(MAR),
aprMean = mean(APR), mayMean = mean(MAY), junMean = mean(JUN),
julMean = mean(JUL), augMean = mean(AUG), sepMean = mean(SEP),
octMean = mean(OCT), novMean = mean(NOV), decMean = mean(DEC),
annMeanInsol = mean(ANN), n = n())
usInsolAnn <- usInsol %>% select(state, annMeanInsol)
usWind <- us %>%
select(-LAT, -LON, -COUNTRY, -index) %>%
filter(PARAMETER == "WS50M") %>%
group_by(state) %>%
summarize(janMean = mean(JAN), febMean = mean(FEB), marMean = mean(MAR),
aprMean = mean(APR), mayMean = mean(MAY), junMean = mean(JUN),
julMean = mean(JUL), augMean = mean(AUG), sepMean = mean(SEP),
octMean = mean(OCT), novMean = mean(NOV), decMean = mean(DEC),
annMeanWs50m = mean(ANN), n = n())
usWindAnn <- usWind %>% select(state, annMeanWs50m)
usInsolWind <- left_join(usInsolAnn, usWindAnn, by = "state")# Save us_insolwind data file for ongoing work
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
write.csv(usInsolWind, file = "xusInsolWind.csv", row.names = FALSE)Per the National Archives and Records Administration site linked below, Certificates of Ascertainment are official documents which states submit to the federal government following a presidential election. While there is not a standardized template for the certificates, they must include the following:
- names of the Electors chosen by the voters and the number of votes received
- names of all other candidates for Elector and the number of votes received
Each state’s Certificate of Ascertainment is available in PDF form at the site below: 2016 Presidential Election Certificates of Ascertainment
I collected the summarized vote counts data in a CSV file.
While not falling in 2015, it seems reasonable to argue that results from the 2016 Presidential election are a reasonable proxy for each state’s broad-scale political leanings as of 2015.
Some states’ certificates were considerably easier to tabulate than others. For example, Connecticut not only doesn’t list which candidate is being tallied among the grouped names certifying the vote count for (un)said candidate, but the numbers are written out (‘Eight hundred ninety seven thousand, five hundred twenty four’ instead of ‘897,524’). In those cases, I had to confirm against an external source which candidate carried the state. I also had to make the assumption that the top two vote-getting candidates belonged to either the Democratic or Republican Party - given our political system, however, this wasn’t much of a stretch. I also made a quick check to confirm there were any states where, say, the Libertarian Party candidate the second highest vote-getter.
** Note: All political data was processed together - code is in the State RPS section below**
Data for 2014 Midterm Election results come from the table “Margin of Victory in 2014 United States Senate Elections” at the site below: Ballotpedia results for 2014 Midterm Election site
Since the entirety of the 2014 House’s makeup consisted of representatives in either the Democratic or Republican Party (no third-party winners), I tallied the count of winning candidates by party for each state in a CSV file.
** Note: All political data was processed together - code is in the State RPS section below**
Per the NCSL page linked below, renewable portfolio standards > …require utilities to sell a specified percentage or amount of renewable electricity. The requirement can apply only to investor-owned utilities (IOUs) but many states also include municipalities and electric cooperatives (Munis and Co-ops), though their requirements are equivalent or lower.
It should be noted that RPS legislation can differ widely both in its requirement level (% or amount of renewable energy generation required) and in its time window (year by which the targeted standard must be met). In order to keep things more manageable, I classified states in broad categories:
Data for each state’s RPS status, summarizing the year(s) in which RPS legislation was passed (if any such legislation was passed) can be found at the following link: Nat’l Conference of State Legislatures RPS site
The following code was used to process the 2016 Presidential Election, 2014 Midterm Election, and State RPS data:
# 2016 Presidential Election Votes
# Loading the data file
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
library(readxl)
pres2016 <- read_excel("2016PresVotesByState.xlsx")
# Calculate % Democratic party votes from Total
library(dplyr)
pres2016 <- pres2016 %>% mutate(state = State,
pres2016PctDem = Votes2016Dem*100 / Votes2016Tot) %>%
select(state, pres2016PctDem)
# 2014 Congressional Election, Representative Composition
# Loading the data file
# Note: no independent or third party representatives were elected
# Note: DC has a delegate who cannot vote on the House floor
houseRep2014 <- read_excel("Counts 114th Congress.xlsx")
houseRep2014 <- houseRep2014 %>% mutate(state = State,
house2014Total =
`Dem House 114th Congress` +
`Rep House 114th Congress`,
house2014PctDem =
if_else(state == "District of Columbia", 0,
`Dem House 114th Congress`*100 / house2014Total)) %>%
select(state, house2014PctDem)
# State Renewable Portfolio Standard (RPS) status
library(stringr)
stateRPS <- read_excel("StateRPS_status2015.xlsx") %>%
mutate(state = State, rps2015 = `RPS Status 2015`) %>%
select(state, rps2015)
# Joining the three tables by State
# and then modifying state names to postal abbreviated form
library(purrr)
#library(dplyr)
poldat <- list(pres2016, houseRep2014, stateRPS)
statePolDat <- reduce(poldat, left_join, by = "state") %>%
mutate(state = if_else(state == "District of Columbia","DC",
state.abb[match(state,state.name)]))
# Save data file
write.csv(statePolDat, file = "xstatePolDat.csv", row.names = FALSE)** The following code was used to merge all of the preceding data into a coherent dataset for exploratory data analysis and model building:**
# Set the working directory and read in the individual data subsets
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
energy2015 <- read.csv("xenergy2015.csv")
stateGDP2015 <- read.csv("xstateGDP2015.csv")
statePolDat <- read.csv("xstatePolDat.csv")
persIncEduPct <- read.csv("xpersIncomeEduPct.csv")
usInsolWind <- read.csv("xusInsolWind.csv")
# Merge each data subset by State
library(purrr)
library(dplyr)
datSubsets <- list(energy2015, stateGDP2015, statePolDat, persIncEduPct, usInsolWind)
finalDat <- reduce(datSubsets, left_join, by = "state") %>% filter(state != "DC")
write.csv(finalDat, file = "finalDat.csv", row.names = FALSE)My finalDat dataset, which contains all variables gathered from the various sources noted in the previous section, contains 46 variables. The below plots are a selected subset of univariate plots which were of particular interest within the scope of the analysis.
The overall EDA process followed the below process:
summary() output of the entire dataset - checking for unexpected negative/very large/missing values, as well as developing a general familiarity with the data.cor() function.corrplot package.# Reading in the data and setting levels for RPS2015 variable
setwd("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/")
library(dplyr)
finalDat <- read.csv("finalDat.csv")
## Create stateName var (U.S. full state name in lowercase) to work with ggplot2 maps
## Order levels of rps2015
## Combine some educPct_ variables
finalDat <- finalDat %>%
mutate(stateName = tolower(state.name[match(state,state.abb)]),
rps2015 = factor(rps2015, levels = c("No", "Voluntary", "Yes")),
eduPctHSOrLess = eduPctLTHS + eduPctHS,
eduPctBacGradProf = eduPctBac + eduPctGradProf)## Warning: package 'bindrcpp' was built under R version 3.4.4
The following code generates histogram plots for five variables of interest:
eConsIndex - This was considered as a potential outcome variable of interest. It is calculated as the difference between eConsPctRE (renewable energy as a % of total energy consumed) and eConsPctFF (fossil fuel energy as a % of total energy consumed). Both of these variables are converted from decimal values, e.g. 50% is ‘50’ rather than ‘0.50’ in the dataset.eConsPctFF - This was chosen as the outcome variable of interest for the model-building process.netEBalance - This is calculated as the difference between total energy production and total energy consumption for each state, measured in trillions of British Thermal Units. Therefore, positive values indicate the state is a net energy exporter, and negative values indicate the state is a net importer.totECons - This is the total 2015 energy consumption for each state, measured in tnBTU.hydroECons - This is the 2015 energy consumption from hydroelectricity for each state, in tnBTU. While solar and wind seem to be the most commonly referenced sources of renewable energy, hydropower was the largest source of overall renewable electricity by far - though its consumption was concentrated in a few states.# Univariate distribution of two outcome variables explored, plus netEBalance and totECons
library(ggplot2)
univPlot1 <- ggplot(finalDat, aes(x = 0.5, y = eConsIndex)) +
geom_boxplot(col = color_set6[6], width = 0.6, outlier.size = 2) +
labs(title = "% Renewable Energy minus % Fossil Fuel \n Consumption in 2015",
x = "", y = "%pt diff. betw. Renewable and Fossil Fuel Energy Consumption") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.line.y = element_blank()) +
scale_x_discrete(limits = c(0,1)) +
scale_y_continuous(limits = c(-100,0)) +
coord_flip()
univPlot2 <- ggplot(finalDat, aes(x = eConsPctFF)) +
geom_histogram(col = "white", fill = color_set6[5], bins = 20) +
labs(title = "Fossil Fuel Energy Consumption \n as % of Total Consumption in 2015",
x = "% Fossil Fuel Energy Consumption of Total", y = "Count") +
theme_ds2() +
theme(plot.margin = unit(rep(0.5,4), "cm"))
# Manually adjusting fill for next plot's bars
# had to manually calculate counts for each color based on an earlier plot version with same # of bins
fillCols <- c(rep(color_set6[1],7), "darkgrey", rep(color_set6[2],12))
univPlot3 <- ggplot(finalDat, aes(x = netEBalance)) +
geom_histogram(col = "white", fill = fillCols, bins = 20) +
geom_text(x = -4000, y = 7, color = color_set6[1],
label = paste("Net importers:",sum(finalDat$netEBalance <0))) +
geom_text(x = 4000, y = 7, color = color_set6[2],
label = paste("Net exporters:",sum(finalDat$netEBalance >0))) +
scale_x_continuous(breaks = seq(-5000,10000,2500)) +
labs(title = "Net Energy Balance in 2015",
x = "Production minus Consumption, tnBTUs", y = "Count",
subtitle = "Positive values indicate energy exports",
caption = "Note: No states were exactly 0") +
theme_ds2()
univPlot4 <- ggplot(finalDat, aes(x = totECons)) +
geom_histogram(col = "white", fill = color_set6[6], bins = 20) +
scale_x_continuous(breaks = seq(0,15000,2500)) +
labs(title = "Total Energy Consumption in 2015",
x = "tnBTUs", y = "Count") +
theme_ds2()
univPlot5 <- ggplot(finalDat, aes(x = hydroECons)) +
geom_histogram(col = "white", fill = color_set6[4], bins = 20) +
labs(title = "Hydroelectricity Consumption in 2015",
x = "tnBTUs", y = "Count") +
theme_ds2()univPlot1As shown in the above plot, no states consumed more renewable energy than fossil fuel energy in 2015, and for most states renewable consumption was considerably less than one-third that of fossil fuel consumption. The outlier states on the far left are Washington (-18.3) and Oregon (-23.2).
univPlot2Here we see almost the reverse of the previous plot. The vast majority of states derived more than 2/3 of their 2015 total energy consumption from fossil fuels.
univPlot3More than three times as many states were net energy importers as were net exporters in 2015. The top two net exporter states (Wyoming and Texas) both have larger positive values than the negative value of the top net importer state (California). The sum of netEBalance is negative, which I believe reflects energy imports from abroad (e.g. imported oil, and possibly hydroelectricity from Canada).
univPlot4The distribution of total energy consumption in 2015 is strongly right skewed, with Texas reigning supreme at 12,848 tnBTUs. California comes in second at 6,825 tnBTUs. Texas is notably rich in oil and natural gas, which is why it was second in net energy exports despite such high consumption behavior.
univPlot5In similar fashion to total energy consumption, albeit at a much smaller absolute scale, 2015 hydroelectricity consumption is highly right skewed. Washington leads by a mile with 684 tnBTUs consumed, followed distantly by Oregon (281) and New York (242). Hydroelectricity makes up 49% of all renewable energy consumed in 2015, followed by wind (37.5%).
While essentially all variables were plotted using the U.S. map (with state shapes having fill colors proportionate to that state’s value on the fill-color metric), the following plots focus on some variables included in the final model.
Note: The plot shown in the presentation has the caption note “Presentation version”; additional versions, modified based on post-presentation feedback, are included at the beginning of the appendix to show the data in a slightly different way.
# Creating ridge plots to detail energy consumption by specific sources across types
# PRESENTATION VERSION - see Appendix for additional versions
# Including only Total Renewables among Total (source) variables
# Total Overall and Total Fossil Fuels are on drastically broader scales than the other data
library(dplyr)
library(tidyr)
# Create lookup table to convert energy consumption descriptors
lut <- c("gthrmECons" = "Geothermal", "hydroECons" = "Hydroelectric", "solarECons" = "Solar",
"windECons" = "Wind", "rETotCons" = "Total Renewables", "biomassECons" = "Biomass",
"nclrECons" = "Nuclear", "coalECons" = "Coal", "natGasECons" = "Natural Gas",
"petroECons" = "Petroleum")
# Rearrange the data to work with ggplot2
eConsDat <- finalDat %>%
select(state, gthrmECons, hydroECons, solarECons, windECons, rETotCons,
biomassECons, nclrECons, coalECons, natGasECons, petroECons) %>%
gather(key = "eConSource", value = tnBTU, -state) %>%
mutate(eConSource = factor(lut[eConSource],
levels = c("Total Renewables", "Geothermal", "Hydroelectric", "Solar",
"Wind", "Biomass", "Nuclear", "Coal", "Natural Gas", "Petroleum")),
eType = factor(if_else(eConSource %in%
c("Total Renewables", "Geothermal", "Hydroelectric",
"Solar", "Wind"), "renewable",
if_else(eConSource %in%
c("Coal", "Natural Gas",
"Petroleum"), "fossil fuel",
"other")),
levels = c("other", "fossil fuel", "renewable")))
# Adding in state population in thousands - needs to be a separate mutate() step after eConsDat dataset is created
eConsDat <- eConsDat %>%
mutate(statePopK = rep(finalDat$statePopK, nrow(eConsDat) / n_distinct(state)))
# Vector to set desired order of density ridges
ridgeOrder <- c("Wind", "Solar", "Hydroelectric", "Geothermal", "Total Renewables",
"Nuclear", "Biomass", "Petroleum", "Natural Gas", "Coal")
# Vector to set colors by category - same as "full" version
fillCols <- c("other" = color_set6[6], "fossil fuel" = color_set6[5],
"renewable" = color_set6[4])
# Function to make Total category labels bold
# Modified from StackResource user hrbrmster from the below link:
# https://stackoverflow.com/questions/39694490/highlighting-individual-axis-labels-in-bold-using-ggplot2#39695859
embolden <- function(src, boldThis) {
if (!is.factor(src)) src <- factor(src) # make sure it's a factor
src_levels <- levels(src) # retrieve the levels in their order
brave <- boldThis %in% src_levels # make sure everything we want bold in factor levels
if (all(brave)) { # if so
b_pos <- purrr::map_int(boldThis, ~which(.==src_levels)) # then find out where they are
b_vec <- rep("plain", length(src_levels)) # make'm all plain first
b_vec[b_pos] <- "bold" # make our targets bold
b_vec # return the new vector
} else {
stop("All elements of 'boldThis' must be in src")
}
}
# Create boldList and fullList for the plot set
boldList <- ("Total Renewables")
fullList <- factor(ridgeOrder, levels = ridgeOrder)
library(ggplot2)
library(ggridges)
eConsPlot1 <- ggplot(eConsDat, aes(x = tnBTU, y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(0,7000,1000)) +
scale_y_discrete(limits = ridgeOrder) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category - Per-State",
x = "tnBTUs", y = "Energy Source", caption = "Presentation version") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList, boldList)))
eConsPlot2 <- ggplot(eConsDat, aes(y = tnBTU, x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0,7000,1000)) +
scale_x_discrete(limits = ridgeOrder) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category - Per-State",
y = "tnBTUs", x = "Energy Source", caption = "Presentation version") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank()) +
coord_flip()
library(cowplot)
plot_grid(eConsPlot1, eConsPlot2, ncol = 2, align = "h", rel_widths = c(1.15,0.8))Not only was this plot fun to figure out (how to make), but it does a decent job of showing the sizable differences in energy consumption for 2015 by energy source, using either density ridge curves (left) or boxplots (right). It also makes it clear that all renewables combined are a much smaller component of U.S. energy consumption than any of the individual fossil fuel sources.
## Plot US states by education levels and per capita GDP / income
#library(ggplot2)
library(fiftystater)## Warning: package 'fiftystater' was built under R version 3.4.4
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2())
### Fossil Fuel Energy Consumption as % of Total Energy Consumption
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eConsPctFF/100), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Fossil Fuel Energy Consumption \n as % of Total Consumption",
x = "", y = "") This is a geographical representation of the outcome variable of interest - it can be seen that there are some ‘clusters’ of especially fossil fuel consumption-intensive states in the lower Midwest, the Southwest, and the Mountain West.
#library(ggplot2)
#library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2())
### Mean Annual Solar Insolation
insolPlot <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = annMeanInsol), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = expression(paste("kW/m",""^2," per day")),
low = color_set6[1], high = color_set6[4]) +
labs(title = expression(paste("Mean Annual Solar Insolation (kW/m",""^2," per day)")),
x = "", y = "")
### Mean Annual Wind Speed at 50m off the ground
windSpdPlot <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = annMeanWs50m), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "mph",
low = color_set6[1], high = color_set6[4]) +
labs(title = expression(paste("Mean Annual Wind Speed at Height 50m (mph)")),
x = "", y = "")
## Map plot of Renewable Energy Portfolio Standards (RPS)
fill_colors <- c("No" = blues_set5[5], "Voluntary" = blues_set5[3], "Yes" = blues_set5[2])
rpsPlot <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = rps2015), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_manual(values = fill_colors) +
labs(title = expression(paste("2015 Renewable Portfolio Standards (RPS) by State")),
x = "", y = "")insolPlotThe above plot uses the processed NASA data, which shows there is a clear gradation in mean annual solar potential, which is is greater for states closer to the Equator - as could be expected.
windSpdPlotThe plot of mean annual wind speed, however, shows a distinct ‘wind corridor’ in the center of the country, as well as a smaller region of higher mean annual wind speeds around the Chesapeake Bay.
rpsPlotThe above plot shows that ‘renewable portfolio standard’ legislation was not equally likely to have been in effect as of 2015 across the country. The deep south/Appalachia and central mountain west regions were more likely to not have legislation requiring some target percentage/quantity of renewable energy as of a target date (percentages/quantities and target dates varying by state).
After considering these and related plots, I decided to remove some variables from the final dataset based on their redundancy:
# Removing the following vars as not connecting strongly to any underlying theory about energy consumption:
# gdpPctGov, gdpPctHosp, gdpPctUtil, and gdpPctOth
# Keeping "HS or Less" and "Bachelor's or More" variables and removing the following:
# eduPctLTHS, eduPctHS, eduPctBac, eduPctGradProf, eduPctSmColAssoc
# Removing house2014PctDem (pres2016PctDem covers political metrics) and stateName (state covers this)
# Removing gthrmECons (miniscule values for all but California and to a lesser extent Nevada)
#library(dplyr)
finalDatV2 <- finalDat %>%
select(-c(gdpPctGovt, gdpPctHosp, gdpPctUtil, gdpPctOth,
eduPctLTHS, eduPctHS, eduPctBac, eduPctGradProf, eduPctSmColAssoc,
house2014PctDem, stateName, gthrmECons))options(show.signif.stars = FALSE)netEBalance; if need be, removing an [energy source]ECons or _EProd variable# Preliminary analyses - full model summary, VIF, basic model diagnostics
# Dependent Variable: % Renewal Energy Consumption minus % Fossil Fuel Energy Consumption
# I want to keep in netEBalance if possible, so removing petroECons
#library(dplyr)
prelimMod1Dat <- finalDatV2 %>%
select(-c(rETotCons, ffETotCons, ffETotProd:totEProd, eConsPctRE, eConsPctFF))
prelimMod1 <- lm(eConsIndex ~ . -state -petroECons -natGasECons, data = prelimMod1Dat)
summary(prelimMod1)##
## Call:
## lm(formula = eConsIndex ~ . - state - petroECons - natGasECons,
## data = prelimMod1Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5649 -2.4559 -0.4202 3.2039 17.7496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.905e+00 6.090e+01 -0.097 0.923565
## hydroECons 8.808e-02 2.047e-02 4.303 0.000244
## solarECons 4.005e-02 1.160e-01 0.345 0.732912
## windECons 9.909e-02 7.616e-02 1.301 0.205584
## biomassECons 2.503e-02 9.622e-02 0.260 0.797003
## nclrECons 2.848e-02 8.982e-03 3.170 0.004125
## coalECons -3.231e-02 1.126e-02 -2.868 0.008466
## biomassEProd 9.747e-03 6.312e-02 0.154 0.878576
## coalEProd -1.769e-03 4.837e-03 -0.366 0.717734
## natGasEProd -4.137e-03 3.803e-03 -1.088 0.287405
## petroEProd -2.016e-03 4.585e-03 -0.440 0.664126
## netEBalance 5.335e-03 4.211e-03 1.267 0.217365
## statePopK 3.008e-03 2.680e-03 1.122 0.272810
## gdpPerCapK 1.071e-01 7.163e-01 0.150 0.882368
## gdpTotMn -4.760e-05 3.774e-05 -1.261 0.219302
## gdpPctBusProfSvc -1.065e+00 6.084e-01 -1.751 0.092732
## gdpPctManuf -3.892e-01 6.082e-01 -0.640 0.528312
## gdpPctMining -2.435e+00 9.254e-01 -2.631 0.014640
## pres2016PctDem -7.232e-01 4.110e-01 -1.760 0.091171
## rps2015Voluntary 4.826e+00 5.999e+00 0.805 0.428967
## rps2015Yes 1.143e+01 7.953e+00 1.437 0.163591
## persIncPerCapK -5.678e-02 1.517e+00 -0.037 0.970459
## annMeanInsol -4.054e+00 4.107e+00 -0.987 0.333382
## annMeanWs50m -9.215e-01 2.728e+00 -0.338 0.738440
## eduPctHSOrLess 1.002e-01 8.453e-01 0.119 0.906630
## eduPctBacGradProf 5.218e-01 1.347e+00 0.387 0.701948
##
## Residual standard error: 9.382 on 24 degrees of freedom
## Multiple R-squared: 0.857, Adjusted R-squared: 0.7081
## F-statistic: 5.755 on 25 and 24 DF, p-value: 2.778e-05
library(car)
vif(prelimMod1)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 2.707798 1 1.645539
## solarECons 7.017605 1 2.649076
## windECons 14.865116 1 3.855531
## biomassECons 28.182837 1 5.308751
## nclrECons 2.202829 1 1.484193
## coalECons 6.997151 1 2.645213
## biomassEProd 20.487969 1 4.526364
## coalEProd 13.317774 1 3.649353
## natGasEProd 20.341909 1 4.510201
## petroEProd 13.798228 1 3.714597
## netEBalance 43.356673 1 6.584578
## statePopK 207.827329 1 14.416218
## gdpPerCapK 23.542847 1 4.852097
## gdpTotMn 160.572493 1 12.671720
## gdpPctBusProfSvc 12.173113 1 3.488999
## gdpPctManuf 6.609711 1 2.570936
## gdpPctMining 11.896951 1 3.449196
## pres2016PctDem 10.169585 1 3.188979
## rps2015 11.377292 2 1.836579
## persIncPerCapK 21.583212 1 4.645774
## annMeanInsol 3.019958 1 1.737803
## annMeanWs50m 3.845421 1 1.960975
## eduPctHSOrLess 10.880796 1 3.298605
## eduPctBacGradProf 24.667824 1 4.966671
# Very high VIF values for quite a few variables
# Dropped natGasECons on second go 'round (to avoid aliased coefficients)
# Model Diagnostics
#library(ggplot2)
library(ggfortify)
autoplot(prelimMod1, which = c(1,2)) +
labs(caption = "prelimMod1") +
theme_ds2()While eConsIndex (% of state renewable energy consumption minus % fossil fuel consumption) was considered as a target outcome of interest, it was dropped from consideration due to the clear violations of constant variance and normal errors as shown in the full model - the next candidate outcome variable performed significantly better here.
# Preliminary analyses - full model summary, VIF, basicmodel diagnostics
# Dependent Variable: % Fossil Fuel Energy Consumption
# I want to keep in netEBalance if possible, so removing petroECons
#library(dplyr)
prelimMod2Dat <- finalDatV2 %>%
select(-c(rETotCons, ffETotCons, ffETotProd:eConsIndex, eConsPctRE))
prelimMod2 <- lm(eConsPctFF ~ . -state -petroECons -natGasECons, data = prelimMod2Dat)
summary(prelimMod2)##
## Call:
## lm(formula = eConsPctFF ~ . - state - petroECons - natGasECons,
## data = prelimMod2Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.394 -2.082 -0.133 2.648 14.748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 4.114e+01 1.436 0.16391
## hydroECons -4.337e-02 1.383e-02 -3.137 0.00448
## solarECons -2.683e-02 7.837e-02 -0.342 0.73508
## windECons -2.840e-02 5.145e-02 -0.552 0.58608
## biomassECons -3.102e-02 6.500e-02 -0.477 0.63759
## nclrECons -3.098e-02 6.068e-03 -5.105 3.18e-05
## coalECons 2.200e-02 7.609e-03 2.891 0.00803
## biomassEProd -1.567e-02 4.264e-02 -0.367 0.71654
## coalEProd 3.165e-03 3.268e-03 0.968 0.34249
## natGasEProd 4.193e-03 2.569e-03 1.632 0.11568
## petroEProd -2.728e-04 3.098e-03 -0.088 0.93056
## netEBalance -5.375e-03 2.845e-03 -1.889 0.07100
## statePopK -1.394e-03 1.811e-03 -0.770 0.44873
## gdpPerCapK 3.826e-01 4.839e-01 0.791 0.43690
## gdpTotMn 2.077e-05 2.549e-05 0.815 0.42318
## gdpPctBusProfSvc 3.927e-01 4.110e-01 0.955 0.34886
## gdpPctManuf -1.140e-01 4.109e-01 -0.277 0.78389
## gdpPctMining 1.397e+00 6.251e-01 2.234 0.03506
## pres2016PctDem 4.760e-01 2.776e-01 1.715 0.09930
## rps2015Voluntary -1.873e+00 4.053e+00 -0.462 0.64811
## rps2015Yes -5.371e+00 5.373e+00 -1.000 0.32748
## persIncPerCapK -1.128e+00 1.025e+00 -1.100 0.28228
## annMeanInsol 1.810e+00 2.775e+00 0.652 0.52040
## annMeanWs50m 7.936e-02 1.843e+00 0.043 0.96601
## eduPctHSOrLess -7.478e-02 5.711e-01 -0.131 0.89692
## eduPctBacGradProf -3.824e-02 9.102e-01 -0.042 0.96684
##
## Residual standard error: 6.338 on 24 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.7014
## F-statistic: 5.603 on 25 and 24 DF, p-value: 3.508e-05
library(car)
vif(prelimMod2)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 2.707798 1 1.645539
## solarECons 7.017605 1 2.649076
## windECons 14.865116 1 3.855531
## biomassECons 28.182837 1 5.308751
## nclrECons 2.202829 1 1.484193
## coalECons 6.997151 1 2.645213
## biomassEProd 20.487969 1 4.526364
## coalEProd 13.317774 1 3.649353
## natGasEProd 20.341909 1 4.510201
## petroEProd 13.798228 1 3.714597
## netEBalance 43.356673 1 6.584578
## statePopK 207.827329 1 14.416218
## gdpPerCapK 23.542847 1 4.852097
## gdpTotMn 160.572493 1 12.671720
## gdpPctBusProfSvc 12.173113 1 3.488999
## gdpPctManuf 6.609711 1 2.570936
## gdpPctMining 11.896951 1 3.449196
## pres2016PctDem 10.169585 1 3.188979
## rps2015 11.377292 2 1.836579
## persIncPerCapK 21.583212 1 4.645774
## annMeanInsol 3.019958 1 1.737803
## annMeanWs50m 3.845421 1 1.960975
## eduPctHSOrLess 10.880796 1 3.298605
## eduPctBacGradProf 24.667824 1 4.966671
# High VIF values for: quite a few, especially statePopK and gdpTotMn - the two are very likely interrelated
# Dropped natGasECons on second go 'round
# Model Diagnostics
#library(ggplot2)
#library(ggfortify)
autoplot(prelimMod2, which = c(1,2)) +
labs(caption = "prelimMod2") +
theme_ds2()Here, the residual plot and Q-Q plot each look considerably better - eConsPctFF was chosen as the target outcome variable for further model analysis.
Two general approaches were taken for generating candidate models:
step() function variable selection (stepwise, forward selection, and backward elimination)leaps package# Model selection based on AIC
#library(dplyr)
prelimMod2Dat <- finalDatV2 %>%
select(-c(rETotCons, ffETotCons, ffETotProd:eConsIndex, eConsPctRE,
petroECons, natGasECons))
emptyMod2 <- lm(eConsPctFF ~ 1, data = prelimMod2Dat)
fullMod2 <- lm(eConsPctFF ~ . -state, data = prelimMod2Dat)
stepMod2a <- step(emptyMod2, scope = formula(fullMod2), direction = "both", trace = 0)
fwdMod2a <- step(emptyMod2, scope = formula(fullMod2), direction = "forward", trace = 0)
bwdMod2a <- step(fullMod2, direction = "backward", trace = 0)
# Comparison of stepwise, forward selection, and backward elimination AIC models
summary(stepMod2a)
extractAIC(stepMod2a)
summary(fwdMod2a)
extractAIC(fwdMod2a)
summary(bwdMod2a)
extractAIC(bwdMod2a)After adding statePopK to the dataset (recommended among other post-presentation feedback), stepwise selection and backward elimination models have converged to the same model, which was not the case in the pre-presentation analysis. The forward selection model remains distinct from these others, yet it actually includes statePopK while the other model does not.
Note: During the ‘updated’ analysis, I discovered the ‘new’ stepMod2a is actually the exact same model as adjR2Mod4 (including the problematic VIF value for netEBalance which is optimally resolved by removing coalEProd). While stepMod2a is retained in the multicollinearity check section for illustrative purposes, it was removed from further consideration for the rest of the analysis since it would resolve to adjR2Mod4 anyway.
Additionally, as part of the post-presentation analysis, I also evaluated using the distinct energy consumption/production categories (windECons, etc.) or the broader catgories (rETotCons for total renewable consumption, etc.) - for all three of the step()-based models, using the distinct categories resulted in models with higher R2adjs and lower AICs.
The stepwise selection/backward elimination model has a higher R2adj value and a lower AIC value compared to the forward selection model. Since the stepwise selection and backaward elimination models are the same, only stepMod2a will be carried forward, and bwdMod2a is dropped.
# Model selection based on BIC
#library(dplyr)
n <- nrow(prelimMod2Dat)
stepMod2b <- step(emptyMod2, scope = formula(fullMod2), direction = "both", trace = 0, k = log(n))
fwdMod2b <- step(emptyMod2, scope = formula(fullMod2), direction = "forward", trace = 0, k = log(n))
bwdMod2b <- step(fullMod2, direction = "backward", trace = 0, k = log(n))
# Comparison of stepwise, forward selection, and backward elimination models
summary(stepMod2b)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + gdpPctMining, data = prelimMod2Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8916 -6.0817 -0.7561 6.3778 15.7729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.80508 1.64218 48.60 < 2e-16
## hydroECons -0.04490 0.01233 -3.64 0.000676
## gdpPctMining 0.96313 0.26604 3.62 0.000718
##
## Residual standard error: 9.208 on 47 degrees of freedom
## Multiple R-squared: 0.3954, Adjusted R-squared: 0.3697
## F-statistic: 15.37 on 2 and 47 DF, p-value: 7.323e-06
extractAIC(stepMod2b)## [1] 3.000 224.918
summary(fwdMod2b)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + gdpPctMining, data = prelimMod2Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8916 -6.0817 -0.7561 6.3778 15.7729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.80508 1.64218 48.60 < 2e-16
## hydroECons -0.04490 0.01233 -3.64 0.000676
## gdpPctMining 0.96313 0.26604 3.62 0.000718
##
## Residual standard error: 9.208 on 47 degrees of freedom
## Multiple R-squared: 0.3954, Adjusted R-squared: 0.3697
## F-statistic: 15.37 on 2 and 47 DF, p-value: 7.323e-06
extractAIC(fwdMod2b)## [1] 3.000 224.918
summary(bwdMod2b)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + nclrECons + coalECons +
## biomassEProd + coalEProd + natGasEProd + netEBalance + statePopK +
## gdpPerCapK + gdpPctManuf + gdpPctMining + pres2016PctDem +
## persIncPerCapK, data = prelimMod2Dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6776 -2.1410 -0.4088 2.5854 15.0825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.1147109 8.1816051 9.792 1.09e-11
## hydroECons -0.0471516 0.0082951 -5.684 1.84e-06
## nclrECons -0.0289995 0.0048435 -5.987 7.23e-07
## coalECons 0.0178390 0.0044869 3.976 0.000324
## biomassEProd -0.0429318 0.0106048 -4.048 0.000262
## coalEProd 0.0061673 0.0018850 3.272 0.002362
## natGasEProd 0.0053323 0.0017956 2.970 0.005280
## netEBalance -0.0075440 0.0018117 -4.164 0.000186
## statePopK -0.0005839 0.0002757 -2.118 0.041184
## gdpPerCapK 0.8035273 0.1998441 4.021 0.000284
## gdpPctManuf -0.5063853 0.2229169 -2.272 0.029185
## gdpPctMining 0.7139197 0.3205504 2.227 0.032279
## pres2016PctDem 0.3017226 0.1405469 2.147 0.038622
## persIncPerCapK -1.6237786 0.4686589 -3.465 0.001389
##
## Residual standard error: 5.58 on 36 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.7686
## F-statistic: 13.52 on 13 and 36 DF, p-value: 3.675e-10
extractAIC(bwdMod2b)## [1] 14.0000 183.4871
Using BIC selection, the stepwise and forward selection models both converged on a very simple model - and a very limited one. Given the incredibly low R2adj value of this model, this model will not be considered for further analysis.
While in pre-presentation analysis the backward elimination BIC model was the same as the backward elimination AIC model, here it is slightly different (the AIC model uses gdpPctBusProfSvc and rps2015, while the BIC model uses statePopK). Therefore the backward elimination BIC model will be considered further.
None of the following candidate models, generated using the leaps package and optimizing for R2adj, changed during the post-presentation expanded analysis. Newly added variable statePopK was not featured in these models, though it was included in lower-R2adj models.
# Model selection based on adjusted R-squared
library(leaps)
leapsMod <- regsubsets(eConsPctFF ~ . -state, data = prelimMod2Dat, nbest = 2, nvmax = 100)
# Plot to identify the top candidate models - choosing top 5 only (just for convenience)
plot(leapsMod, scale = "adjr2", col = rev(blues_set5), main = "Candidate Models \n Optimizing for Adjusted R-squared")# Top 5 candidate models are:
adjR2Mod1 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + coalEProd + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK, data = finalDatV2)
adjR2Mod2 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + coalEProd + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK + annMeanInsol, data = finalDatV2)
adjR2Mod3 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + coalEProd + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK + gdpPctManuf, data = finalDatV2)
adjR2Mod4 <- lm(eConsPctFF ~ hydroECons + biomassECons + nclrECons + coalECons + coalEProd + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK, data = finalDatV2)
adjR2Mod5 <- lm(eConsPctFF ~ hydroECons + biomassECons + nclrECons + coalECons + coalEProd + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK +annMeanInsol, data = finalDatV2)
rm(leapsMod)
# VIF check for each candidate model
#library(car)
#vif(adjR2Mod1) netEBalance = 12.2
#vif(adjR2Mod2) netEBalance = 12.2
#vif(adjR2Mod3) netEBalance = 13.2
#vif(adjR2Mod4) netEBalance = 12.1
#vif(adjR2Mod5) netEBalance = 12.2
# Since each has problematic VIF results with netEBalance,
# I looked into just removing that variable, but it results in significantly lower adjusted R-squared values
# Instead, I looked into removing another variable (checking one by one) from the top 5 candidate models,
# then re-checked the adjusted R-squared (goal: highest) and VIF (goal: all <10) results
# the following are the NEW top 5 candidate models:
# Removing coalEProd from adjR2Mod1
adjR2Mod1 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK, data = finalDatV2)
summary(adjR2Mod1)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.211 -3.480 0.469 3.003 13.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.6190841 8.2411747 6.749 8.08e-08
## hydroECons -0.0389079 0.0083564 -4.656 4.51e-05
## windECons -0.0317182 0.0236528 -1.341 0.18856
## biomassECons -0.0517788 0.0197021 -2.628 0.01266
## nclrECons -0.0290222 0.0051005 -5.690 1.98e-06
## coalECons 0.0232862 0.0041047 5.673 2.08e-06
## natGasEProd 0.0025489 0.0011645 2.189 0.03538
## netEBalance -0.0030621 0.0009525 -3.215 0.00280
## gdpPerCapK 0.3716909 0.2190247 1.697 0.09857
## gdpPctBusProfSvc 0.5639404 0.2139974 2.635 0.01244
## gdpPctMining 1.5794883 0.3546658 4.453 8.26e-05
## pres2016PctDem 0.5907795 0.1849875 3.194 0.00297
## rps2015Voluntary -2.9014216 2.9531208 -0.982 0.33260
## rps2015Yes -7.2608336 3.3682877 -2.156 0.03807
## persIncPerCapK -1.2366064 0.4731624 -2.613 0.01312
##
## Residual standard error: 5.737 on 35 degrees of freedom
## Multiple R-squared: 0.8253, Adjusted R-squared: 0.7554
## F-statistic: 11.81 on 14 and 35 DF, p-value: 2.184e-09
extractAIC(adjR2Mod1)## [1] 15.0000 186.8549
# Removing coalEProd from adjR2Mod2
adjR2Mod2 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK + annMeanInsol, data = finalDatV2)
summary(adjR2Mod2)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0642 -2.8895 0.2634 3.0247 14.2237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2409037 12.8216871 3.528 0.001221
## hydroECons -0.0376778 0.0084239 -4.473 8.20e-05
## windECons -0.0366325 0.0240691 -1.522 0.137263
## biomassECons -0.0511388 0.0196795 -2.599 0.013743
## nclrECons -0.0286481 0.0051045 -5.612 2.75e-06
## coalECons 0.0237288 0.0041194 5.760 1.76e-06
## natGasEProd 0.0026056 0.0011639 2.239 0.031826
## netEBalance -0.0032360 0.0009651 -3.353 0.001972
## gdpPerCapK 0.3863070 0.2191075 1.763 0.086872
## gdpPctBusProfSvc 0.5126141 0.2191155 2.339 0.025323
## gdpPctMining 1.5608987 0.3545283 4.403 0.000101
## pres2016PctDem 0.5117596 0.1992873 2.568 0.014797
## rps2015Voluntary -3.1664531 2.9590054 -1.070 0.292109
## rps2015Yes -6.5976857 3.4210235 -1.929 0.062168
## persIncPerCapK -1.0194068 0.5152777 -1.978 0.056036
## annMeanInsol 1.9744997 1.8708827 1.055 0.298692
##
## Residual standard error: 5.727 on 34 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.7562
## F-statistic: 11.13 on 15 and 34 DF, p-value: 4.671e-09
extractAIC(adjR2Mod2)## [1] 16.0000 187.2432
# Removing coalEProd from adjR2Mod3
adjR2Mod3 <- lm(eConsPctFF ~ hydroECons + windECons + biomassECons + nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK + gdpPctManuf, data = finalDatV2)
summary(adjR2Mod3)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + gdpPctManuf, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2332 -3.4872 0.4816 3.0273 13.1874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.3984396 11.1333997 4.976 1.85e-05
## hydroECons -0.0389415 0.0085519 -4.554 6.46e-05
## windECons -0.0316346 0.0241590 -1.309 0.199169
## biomassECons -0.0518574 0.0201603 -2.572 0.014645
## nclrECons -0.0290122 0.0051858 -5.595 2.90e-06
## coalECons 0.0232403 0.0044367 5.238 8.43e-06
## natGasEProd 0.0025453 0.0011875 2.143 0.039324
## netEBalance -0.0030579 0.0009764 -3.132 0.003563
## gdpPerCapK 0.3694544 0.2343796 1.576 0.124215
## gdpPctBusProfSvc 0.5682329 0.2599883 2.186 0.035828
## gdpPctMining 1.5866492 0.4317445 3.675 0.000813
## pres2016PctDem 0.5917347 0.1903647 3.108 0.003788
## rps2015Voluntary -2.9105038 3.0114356 -0.966 0.340627
## rps2015Yes -7.2775364 3.4624289 -2.102 0.043049
## persIncPerCapK -1.2343917 0.4857014 -2.541 0.015768
## gdpPctManuf 0.0076042 0.2533482 0.030 0.976231
##
## Residual standard error: 5.82 on 34 degrees of freedom
## Multiple R-squared: 0.8253, Adjusted R-squared: 0.7482
## F-statistic: 10.71 on 15 and 34 DF, p-value: 7.755e-09
extractAIC(adjR2Mod3)## [1] 16.0000 188.8536
# Removing coalEProd from adjR2Mod4
adjR2Mod4 <- lm(eConsPctFF ~ hydroECons + biomassECons + nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK, data = finalDatV2)
summary(adjR2Mod4)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + biomassECons + nclrECons +
## coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc +
## gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK,
## data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2747 -3.5801 0.3998 3.1516 12.8097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.2751114 8.3280071 6.637 9.87e-08
## hydroECons -0.0397386 0.0084253 -4.717 3.56e-05
## biomassECons -0.0633592 0.0179040 -3.539 0.00113
## nclrECons -0.0275283 0.0050322 -5.470 3.55e-06
## coalECons 0.0233836 0.0041493 5.636 2.13e-06
## natGasEProd 0.0017029 0.0009896 1.721 0.09389
## netEBalance -0.0030699 0.0009629 -3.188 0.00296
## gdpPerCapK 0.2794871 0.2102453 1.329 0.19210
## gdpPctBusProfSvc 0.6412776 0.2083521 3.078 0.00397
## gdpPctMining 1.6506919 0.3545352 4.656 4.28e-05
## pres2016PctDem 0.6138643 0.1862156 3.297 0.00221
## rps2015Voluntary -4.2720867 2.8011416 -1.525 0.13597
## rps2015Yes -9.1126276 3.1060806 -2.934 0.00579
## persIncPerCapK -1.1494146 0.4738414 -2.426 0.02042
##
## Residual standard error: 5.8 on 36 degrees of freedom
## Multiple R-squared: 0.8163, Adjusted R-squared: 0.7499
## F-statistic: 12.3 on 13 and 36 DF, p-value: 1.36e-09
extractAIC(adjR2Mod4)## [1] 14.00 187.36
# Removing coalEProd from adjR2Mod5
adjR2Mod5 <- lm(eConsPctFF ~ hydroECons + biomassECons + nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK +annMeanInsol, data = finalDatV2)
summary(adjR2Mod5)##
## Call:
## lm(formula = eConsPctFF ~ hydroECons + biomassECons + nclrECons +
## coalECons + natGasEProd + netEBalance + gdpPerCapK + gdpPctBusProfSvc +
## gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK +
## annMeanInsol, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1756 -3.4080 0.6389 2.8868 13.5052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7538904 12.9518346 3.687 0.000764
## hydroECons -0.0389445 0.0085388 -4.561 6.00e-05
## biomassECons -0.0641914 0.0180425 -3.558 0.001098
## nclrECons -0.0270916 0.0050942 -5.318 6.12e-06
## coalECons 0.0237136 0.0041961 5.651 2.23e-06
## natGasEProd 0.0016493 0.0009980 1.653 0.107335
## netEBalance -0.0031961 0.0009827 -3.252 0.002536
## gdpPerCapK 0.2797254 0.2114836 1.323 0.194519
## gdpPctBusProfSvc 0.6129101 0.2128647 2.879 0.006755
## gdpPctMining 1.6452427 0.3566946 4.612 5.14e-05
## pres2016PctDem 0.5594688 0.2004734 2.791 0.008457
## rps2015Voluntary -4.6162949 2.8536708 -1.618 0.114712
## rps2015Yes -8.8413540 3.1446186 -2.812 0.008024
## persIncPerCapK -0.9830709 0.5243147 -1.875 0.069162
## annMeanInsol 1.4236386 1.8697387 0.761 0.451512
##
## Residual standard error: 5.834 on 35 degrees of freedom
## Multiple R-squared: 0.8193, Adjusted R-squared: 0.747
## F-statistic: 11.33 on 14 and 35 DF, p-value: 3.781e-09
extractAIC(adjR2Mod5)## [1] 15.0000 188.5386
Candidate models and the sections that followed changed as a result of post-presentation feedback updates. The revised stepwise selection model (AIC) was eliminated in the VIF check, and revised forward selection (AIC) and backward elimination (BIC) models were included in later analyses.
The fwdMod2a and stepMod2a models (applying forward selection and stepwise selection, respectively) were eliminated from further consideration in the multicollinearity check in the next section; for the remaining models, it’s worth noting the remaining models have many independent variables in common, with very similar coefficient values.
Note: The original forms of bwdMod2b and fwdMod2a included statePopK, but it was removed as a result of checking VIF values. The original forms of both models are still considered in the Multicollinearity Check section, but the updated versions are in the coefficient summary table below.
#updating bwdMod2b and fwdMod2a based on VIF check in next section
bwdMod2b <- update(bwdMod2b, .~. -statePopK)
fwdMod2a <- update(fwdMod2a, .~. -statePopK)
# Creating a summary table of model coefficients for candidate models
#library(dplyr)
library(broom)
library(reshape2)
library(tidyr)
# Organize candidate models and extract/organize model coefficients
modelsList <- list(fwdMod2a=fwdMod2a, bwdMod2b=bwdMod2b, adjR2Mod1=adjR2Mod1, adjR2Mod2=adjR2Mod2, adjR2Mod3=adjR2Mod3, adjR2Mod4=adjR2Mod4, adjR2Mod5=adjR2Mod5)
adjR2List <- sapply(modelsList, summary)["adj.r.squared",]
aicList <- sapply(modelsList, extractAIC)
# Kludge to include adjusted R-squared and AIC data for each model
addThis <- data.frame(term = c(rep("adjR2", 7), rep("nTerms", 7), rep("AIC", 7)),
variable = rep("estimate", 21),
value = c(unlist(adjR2List), unlist(aicList)[1,], unlist(aicList)[2,]),
L1 = rep(names(modelsList), 3))
# Create an aesthetically solid table
library(knitr)
library(kableExtra)
coefSummary <- modelsList %>%
lapply(tidy, .id = "model") %>%
melt() %>%
rbind(.,addThis) %>%
filter(variable == "estimate") %>%
select(term, value, L1) %>%
mutate(coefEst = round(value, 3)) %>%
mutate(term = cell_spec(term, format = "html",
bold = if_else(term %in% c("adjR2", "nTerms", "AIC"), TRUE, FALSE))) %>%
select(L1, term, coefEst) %>%
spread(key = L1, value = coefEst) %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered"))
coefSummary| term | adjR2Mod1 | adjR2Mod2 | adjR2Mod3 | adjR2Mod4 | adjR2Mod5 | bwdMod2b | fwdMod2a |
|---|---|---|---|---|---|---|---|
| (Intercept) | 55.619 | 45.241 | 55.398 | 55.275 | 47.754 | 80.593 | 50.277 |
| annMeanInsol | NA | 1.974 | NA | NA | 1.424 | NA | 1.140 |
| biomassECons | -0.052 | -0.051 | -0.052 | -0.063 | -0.064 | NA | -0.059 |
| biomassEProd | NA | NA | NA | NA | NA | -0.045 | NA |
| coalECons | 0.023 | 0.024 | 0.023 | 0.023 | 0.024 | 0.017 | 0.024 |
| coalEProd | NA | NA | NA | NA | NA | 0.004 | NA |
| gdpPctBusProfSvc | 0.564 | 0.513 | 0.568 | 0.641 | 0.613 | NA | 0.600 |
| gdpPctManuf | NA | NA | 0.008 | NA | NA | -0.320 | NA |
| gdpPctMining | 1.579 | 1.561 | 1.587 | 1.651 | 1.645 | 0.666 | 1.695 |
| gdpPerCapK | 0.372 | 0.386 | 0.369 | 0.279 | 0.280 | 0.698 | NA |
| gdpTotMn | NA | NA | NA | NA | NA | NA | 0.000 |
| hydroECons | -0.039 | -0.038 | -0.039 | -0.040 | -0.039 | -0.051 | -0.037 |
| natGasEProd | 0.003 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 |
| nclrECons | -0.029 | -0.029 | -0.029 | -0.028 | -0.027 | -0.028 | -0.030 |
| netEBalance | -0.003 | -0.003 | -0.003 | -0.003 | -0.003 | -0.004 | -0.003 |
| persIncPerCapK | -1.237 | -1.019 | -1.234 | -1.149 | -0.983 | -1.493 | -0.504 |
| pres2016PctDem | 0.591 | 0.512 | 0.592 | 0.614 | 0.559 | 0.281 | 0.490 |
| rps2015Voluntary | -2.901 | -3.166 | -2.911 | -4.272 | -4.616 | NA | -2.908 |
| rps2015Yes | -7.261 | -6.598 | -7.278 | -9.113 | -8.841 | NA | -7.277 |
| windECons | -0.032 | -0.037 | -0.032 | NA | NA | NA | -0.028 |
| adjR2 | 0.755 | 0.756 | 0.748 | 0.750 | 0.747 | 0.747 | 0.745 |
| AIC | 186.855 | 187.243 | 188.854 | 187.360 | 188.539 | 187.357 | 189.548 |
| nTerms | 15.000 | 16.000 | 16.000 | 14.000 | 15.000 | 13.000 | 16.000 |
Checking VIF values and checking into the feasibility of resolving high VIF values (removing X variables) while retaining adequate R2adj and AIC values was the first check that resulted in trimming the number of candidate models.
For this reason, this section is listed before the check of constant residual scatter and normality of errors.
# 'Revising' bwdMod2b and fwdMod2a to their original forms for the VIF check
bwdMod2b <- update(bwdMod2b, .~. +statePopK)
fwdMod2a <- update(fwdMod2a, .~. +statePopK)
library(car)
# Generate the VIF summaries
adjR2Mod1VIF <- vif(adjR2Mod1)
adjR2Mod2VIF <- vif(adjR2Mod2)
adjR2Mod3VIF <- vif(adjR2Mod3)
adjR2Mod4VIF <- vif(adjR2Mod4)
adjR2Mod5VIF <- vif(adjR2Mod5)
stepMod2aVIF <- vif(stepMod2a)
fwdMod2aVIF <- vif(fwdMod2a)
bwdMod2bVIF <- vif(bwdMod2b)
# Create a VIF summary table with conditionnal formatting to call out problematic VIF values
# Note: stepMod2, fwdMod2, and bwdMod2 VIF output if formatted differently - needed to work around this
vifList <- list(adjR2Mod1=adjR2Mod1VIF, adjR2Mod2=adjR2Mod2VIF, adjR2Mod3=adjR2Mod3VIF, adjR2Mod4=adjR2Mod4VIF, adjR2Mod5=adjR2Mod5VIF, stepMod2a=stepMod2aVIF, fwdMod2a=fwdMod2aVIF, bwdMod2b=bwdMod2bVIF)
options(max.print = 999999)
library(dplyr)
library(broom)
library(reshape2)
library(tidyr)
library(knitr)
library(kableExtra)
vifSummary <- vifList %>%
lapply(tidy) %>%
melt() %>%
filter((variable == "GVIF") | (L1 == "bwdMod2b")) %>%
#mutate(variable = .rownames) %>%
mutate(variable = if_else(L1 == "bwdMod2b", names, .rownames)) %>%
select(variable, value, L1) %>%
mutate(vif = round(value, 2)) %>%
select(L1, variable, vif) %>%
mutate(vif = cell_spec(vif, format = "html", color = if_else(vif >= 10, "red", "black"),
bold = if_else(vif >= 10, TRUE, FALSE))) %>%
spread(key = L1, value = vif) %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered"))
vifSummary| variable | adjR2Mod1 | adjR2Mod2 | adjR2Mod3 | adjR2Mod4 | adjR2Mod5 | bwdMod2b | fwdMod2a | stepMod2a |
|---|---|---|---|---|---|---|---|---|
| annMeanInsol | NA | 1.68 | NA | NA | 1.62 | NA | 2.04 | NA |
| biomassECons | 3.16 | 3.16 | 3.21 | 2.55 | 2.56 | NA | 4.21 | 2.8 |
| biomassEProd | NA | NA | NA | NA | NA | 1.63 | NA | NA |
| coalECons | 2.49 | 2.51 | 2.82 | 2.48 | 2.51 | 3.14 | 3.14 | 3.3 |
| coalEProd | NA | NA | NA | NA | NA | 5.72 | NA | 4.26 |
| gdpPctBusProfSvc | 4.03 | 4.24 | 5.78 | 3.74 | 3.85 | NA | 3.82 | 3.95 |
| gdpPctManuf | NA | NA | 2.98 | NA | NA | 2.51 | NA | NA |
| gdpPctMining | 4.67 | 4.69 | 6.73 | 4.57 | 4.57 | 4.04 | 4.47 | 4.9 |
| gdpPerCapK | 5.89 | 5.91 | 6.55 | 5.31 | 5.31 | 5.18 | NA | 5.39 |
| gdpTotMn | NA | NA | NA | NA | NA | NA | 74.75 | NA |
| hydroECons | 1.21 | 1.23 | 1.23 | 1.2 | 1.22 | 1.26 | 1.25 | 1.22 |
| natGasEProd | 5.1 | 5.11 | 5.15 | 3.61 | 3.62 | 12.82 | 6.16 | 6.59 |
| nclrECons | 1.9 | 1.91 | 1.91 | 1.81 | 1.83 | 1.81 | 1.92 | 1.85 |
| netEBalance | 5.93 | 6.11 | 6.06 | 5.93 | 6.11 | 22.69 | 8.75 | 12.08 |
| persIncPerCapK | 5.61 | 6.68 | 5.75 | 5.51 | 6.66 | 5.82 | 4.02 | 5.51 |
| pres2016PctDem | 5.51 | 6.42 | 5.67 | 5.46 | 6.26 | 3.36 | 6.44 | 5.61 |
| rps2015 | 4.2 | 4.52 | 4.32 | 3.38 | 3.59 | NA | 4.44 | 3.72 |
| statePopK | NA | NA | NA | NA | NA | 6.22 | 90.93 | NA |
| windECons | 3.83 | 3.98 | 3.89 | NA | NA | NA | 3.98 | NA |
As the above table shows, there are problematic VIF values for the bwdMod2a, bwdMod2b, fwdMod2a, and stepMod2a.
Removing statePopK from bwdMod2b resolves the high VIF values for natGasEProd (lowers to 4.7) and netEBalance (lowers to 7.6). Since this update only lowers the the R2adj from 0.768 to 0.747, an updated bwdMod2b will be retained for further analysis.
Resolving problematic VIF values for fwdMod2a is very similar - removing statePopK from the model reduces the VIF value of gdpTotMn (lowers to 4.4) and the highest VIF value is then netEBalance (now 7.9), while maintaining a reasonably high R2adj value (now 0.745, from 0.758). Again, the updated fwdMod2a will be retained.
The high VIF issue for stepMod2a is optimally resolved (keeping maximum R2adj and minimum AIC) by just removing coalEProd. Since this actually just recreated adjR2Mod4, stepMod2a is dropped from further consideration.
rm(stepMod2a)
# Resolving high VIF values for bwdMod2b
bwdMod2b <- update(bwdMod2b, .~. -statePopK)
fwdMod2a <- update(fwdMod2a, .~. -statePopK)# Model Diagnostics for each candidate model
#library(ggplot2)
#library(ggfortify)
autoplot(adjR2Mod1, which = c(1,2)) +
geom_point(col = color_set6[1]) +
labs(caption = "adjR2Mod1") +
theme_ds2()autoplot(adjR2Mod2, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2") +
theme_ds2()autoplot(adjR2Mod3, which = c(1,2)) +
geom_point(col = color_set6[3]) +
labs(caption = "adjR2Mod3") +
theme_ds2()autoplot(adjR2Mod4, which = c(1,2)) +
geom_point(col = color_set6[4]) +
labs(caption = "adjR2Mod4") +
theme_ds2()autoplot(adjR2Mod5, which = c(1,2)) +
geom_point(col = color_set6[5]) +
labs(caption = "adjR2Mod5") +
theme_ds2()autoplot(fwdMod2a, which = c(1,2)) +
geom_point(col = color_set6[6]) +
labs(caption = "fwdMod2a") +
theme_ds2()autoplot(bwdMod2b, which = c(1,2)) +
geom_point(col = "darkgrey") +
labs(caption = "bwdMod2b") +
theme_ds2()For the most part the plots look similar (as would be expected given the similarity in composition of the models), but there are some differences:
adjR2Mod1, adjR2Mod3, and adjR2Mod4 have slightly more ‘diamond-shaped’ residual plot scatter (potentially an issue with observations towards the center of the fitted values range), while adjR2Mod2 and adjR2Mod5 have slightly more ‘box-shaped’ scatter. All models other than bwdMod2b and perhaps adjR2Mod2 have pronounced concentration of scatter for higher-value fitted observations.adjR2Mod1, adjR2Mod3, and fwdMod2a have good-looking Q-Q plots, while the other three adjR2Mod[#] models have distinct patterns in portions of their plotted points (e.g. around Theoretical Quantiles = -1 and = 1), and bwdMod2b has highly pronounced skewness in the tails of its Q-Q Plot.All together, fwdMod2a looks like the best candidate at this point. However, it may be possible to improve one or several of the models by transforming Y (Box-Cox) and/or some of the X variables.
eConsPctFF (Y)library(MASS)
boxcox(adjR2Mod1, lambda = seq(-2,5,0.5))boxcox(adjR2Mod2, lambda = seq(-2,5,0.5))boxcox(adjR2Mod3, lambda = seq(-2,5,0.5))boxcox(adjR2Mod4, lambda = seq(-2,5,0.5))boxcox(adjR2Mod5, lambda = seq(-2,5,0.5))boxcox(fwdMod2a, lambda = seq(-2,5,0.5))boxcox(bwdMod2b, lambda = seq(-2,5,0.5))Interestingly, other than adjR2Mod3, each of the adjR2Mod[#] models and bwdMod2b have 95% confidence intervals centered on a ‘suggested’ λ value between 2 (quadratic transform) and 3 (cubic transform), though 1 (no transformation) is also included (again, except for adjR2Mod3 and perhaps adjR2Mod1).
Therefore, for each candidate model, a quadratic transform of eConsPctFF will be considered in terms of residual plots, Q-Q plots, VIF values, and partial regression (‘added variable’) plots. Along the way, some models may be discarded from further consideration if they are deemed to be significantly inferior to other candidate models.
Note: As a result of post-presentation feedback, cubic transformation was considered in addition to quadratic transformation.
After considering each of the still-eligible candidate models with transformed and non-transformed Y, a final model will be selected for further work.
# Creating a quadratic-transform version of eConsPctFF
#library(dplyr)
finalDatV2 <- finalDatV2 %>%
mutate(eConsPctFF2 = eConsPctFF^2,
eConsPctFF3 = eConsPctFF^3)
# Creating candidate models using the quadratic- and cubic-transformed Y
adjR2Mod1Y2 <- update(adjR2Mod1, -eConsPctFF +eConsPctFF2 ~ .)
adjR2Mod1Y3 <- update(adjR2Mod1, -eConsPctFF +eConsPctFF3 ~ .)
adjR2Mod2Y2 <- update(adjR2Mod2, -eConsPctFF +eConsPctFF2 ~ .)
adjR2Mod2Y3 <- update(adjR2Mod2, -eConsPctFF +eConsPctFF3 ~ .)
adjR2Mod3Y2 <- update(adjR2Mod3, -eConsPctFF +eConsPctFF2 ~ .)
adjR2Mod3Y3 <- update(adjR2Mod3, -eConsPctFF +eConsPctFF3 ~ .)
adjR2Mod4Y2 <- update(adjR2Mod4, -eConsPctFF +eConsPctFF2 ~ .)
adjR2Mod4Y3 <- update(adjR2Mod4, -eConsPctFF +eConsPctFF3 ~ .)
adjR2Mod5Y2 <- update(adjR2Mod5, -eConsPctFF +eConsPctFF2 ~ .)
adjR2Mod5Y3 <- update(adjR2Mod5, -eConsPctFF +eConsPctFF3 ~ .)
fwdMod2aY2 <- update(fwdMod2a, -eConsPctFF +eConsPctFF2 ~ . , data = finalDatV2)
fwdMod2aY3 <- update(fwdMod2a, -eConsPctFF +eConsPctFF3 ~ . , data = finalDatV2)
bwdMod2bY2 <- update(bwdMod2b, -eConsPctFF +eConsPctFF2 ~ . , data = finalDatV2)
bwdMod2bY3 <- update(bwdMod2b, -eConsPctFF +eConsPctFF3 ~ . , data = finalDatV2)
# Evaluating residual plots and Q-Q plots for the non-transformed Y and quadratic-transformed Y models
#library(ggplot2)
#library(ggfortify)
autoplot(adjR2Mod1, which = c(1,2)) +
geom_point(col = color_set6[1]) +
labs(caption = "adjR2Mod1") +
theme_ds2()autoplot(adjR2Mod1Y2, which = c(1,2)) +
geom_point(col = color_set6[1]) +
labs(caption = "adjR2Mod1Y2") +
theme_ds2()autoplot(adjR2Mod1Y3, which = c(1,2)) +
geom_point(col = color_set6[1]) +
labs(caption = "adjR2Mod1Y3") +
theme_ds2()autoplot(adjR2Mod2, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2") +
theme_ds2()autoplot(adjR2Mod2Y2, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2Y2") +
theme_ds2()autoplot(adjR2Mod2Y3, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2Y3") +
theme_ds2()autoplot(adjR2Mod3, which = c(1,2)) +
geom_point(col = color_set6[3]) +
labs(caption = "adjR2Mod3") +
theme_ds2()autoplot(adjR2Mod3Y2, which = c(1,2)) +
geom_point(col = color_set6[3]) +
labs(caption = "adjR2Mod3Y2") +
theme_ds2()autoplot(adjR2Mod3Y3, which = c(1,2)) +
geom_point(col = color_set6[3]) +
labs(caption = "adjR2Mod3Y3") +
theme_ds2()autoplot(adjR2Mod4, which = c(1,2)) +
geom_point(col = color_set6[4]) +
labs(caption = "adjR2Mod4") +
theme_ds2()autoplot(adjR2Mod4Y2, which = c(1,2)) +
geom_point(col = color_set6[4]) +
labs(caption = "adjR2Mod4Y2") +
theme_ds2()autoplot(adjR2Mod4Y3, which = c(1,2)) +
geom_point(col = color_set6[4]) +
labs(caption = "adjR2Mod4Y3") +
theme_ds2()autoplot(adjR2Mod5, which = c(1,2)) +
geom_point(col = color_set6[5]) +
labs(caption = "adjR2Mod5") +
theme_ds2()autoplot(adjR2Mod5Y2, which = c(1,2)) +
geom_point(col = color_set6[5]) +
labs(caption = "adjR2Mod5Y2") +
theme_ds2()autoplot(adjR2Mod5Y3, which = c(1,2)) +
geom_point(col = color_set6[5]) +
labs(caption = "adjR2Mod5Y3") +
theme_ds2()autoplot(fwdMod2a, which = c(1,2)) +
geom_point(col = color_set6[6]) +
labs(caption = "fwdMod2a") +
theme_ds2()autoplot(fwdMod2aY2, which = c(1,2)) +
geom_point(col = color_set6[6]) +
labs(caption = "fwdMod2aY2") +
theme_ds2()autoplot(fwdMod2aY3, which = c(1,2)) +
geom_point(col = color_set6[6]) +
labs(caption = "fwdMod2aY3") +
theme_ds2()autoplot(bwdMod2b, which = c(1,2)) +
geom_point(col = "darkgrey") +
labs(caption = "bwdMod2b") +
theme_ds2()autoplot(bwdMod2bY2, which = c(1,2)) +
geom_point(col = "darkgrey") +
labs(caption = "bwdMod2bY2") +
theme_ds2()autoplot(bwdMod2bY3, which = c(1,2)) +
geom_point(col = "darkgrey") +
labs(caption = "bwdMod2bY3") +
theme_ds2()The quadratic transformation noticeably improves the pattern of residual plot scatter among the adjR2Mod[#] and fwdMod2a models above their counterparts non-quadratic Y, while only marginally affecting bwdMod2b. Cubic transformation further improves the residual scatter for each model except for bwdMod2b, where the slight funneling of left-side observations (quadratic transformation) is exaggerated (cubic transformation).
adjR2Mod1Y2, adjR2Mod3Y2, adjR2Mod4Y2, and adjR2Mod5Y2 (versions with quadratic-transformed Y, or eConsPctFF2) still have distinct narrowing scatter on the right side.
As for the Q-Q plots:
adjR2Mod[#] models. Non-transformed Y models tend to have some sequential patterning of observation points around the Theoretical Quantile 1, while cubic transformation exaggerates skewness in the tails.fwdMod2a seems optimal with no transformation; quadratic and cubic transformations on Y exaggerate skewness around Theoretical Quantile -1.bwdMod2b has problematic skewness across all three considered forms of Y.In each case, observations 30 (New Hampshire), 39 (Rhode Island), and 44 (Utah) are noted - potentially these are outliers, especially Utah.
Based on the above, models each adjR2Mod1 version, adjR2Mod2 and adjR2Mod2Y3, each adjR2Mod3 version, each adjR2Mod4 version, each adjR2Mod5 version, each fwdMod2a version, and each bwdMod2b version are all removed from further consideration.
This leavesadjR2Mod2Y2 as the best candidate for the final model.
Note: This was also the case in the initial/pre-presentation feedback analysis.
library(car)
# Generating the VIF summary
vif(adjR2Mod2Y2)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 1.230803 1 1.109416
## windECons 3.983478 1 1.995865
## biomassECons 3.163347 1 1.778580
## nclrECons 1.908865 1 1.381617
## coalECons 2.511400 1 1.584740
## natGasEProd 5.113630 1 2.261334
## netEBalance 6.110015 1 2.471844
## gdpPerCapK 5.910371 1 2.431125
## gdpPctBusProfSvc 4.237176 1 2.058440
## gdpPctMining 4.685705 1 2.164649
## pres2016PctDem 6.416996 1 2.533179
## rps2015 4.521379 2 1.458202
## persIncPerCapK 6.677684 1 2.584121
## annMeanInsol 1.681609 1 1.296769
VIF is still good for adjR2Mod2Y2. Now it is time to check the model’s partial regression (added-variable) plots to determine if transformations on X are appropriate.
library(MASS)
avPlots(adjR2Mod2Y2, ask = FALSE, col = color_set6[2], pch = 16, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2")The majority of added-variable plots above suggest each variable has a reasonably linear association with eConsPctFF2, except for perhaps natGasEProd, which has a large cloud of points centered around zero and a few distant points at either side.
# Need to recode '0's to miniscule positive value to avoid negative values for the boxTidwell() function
finalDatV3 <- finalDatV2 %>%
mutate(hydroECons = if_else(hydroECons == 0, 1e-99, hydroECons),
windECons = if_else(windECons == 0, 1e-99, windECons),
biomassECons = if_else(biomassECons == 0, 1e-99, biomassECons),
nclrECons = if_else(nclrECons == 0, 1e-99, nclrECons),
coalECons = if_else(coalECons == 0, 1e-99, coalECons),
natGasEProd = if_else(natGasEProd == 0, 1e-99, natGasEProd),
gdpPerCapK = if_else(gdpPerCapK == 0, 1e-99, gdpPerCapK),
gdpPctBusProfSvc = if_else(gdpPctBusProfSvc == 0, 1e-99, gdpPctBusProfSvc),
gdpPctMining = if_else(gdpPctMining == 0, 1e-99, gdpPctMining))
boxTidwell(eConsPctFF ~ hydroECons + biomassECons + coalECons + natGasEProd, other.x = ~ windECons + nclrECons + gdpPerCapK + gdpPctBusProfSvc + persIncPerCapK + rps2015 + gdpPctMining + pres2016PctDem + netEBalance + annMeanInsol, data = finalDatV3, max.iter = 999)## MLE of lambda Score Statistic (z) Pr(>|z|)
## hydroECons 0.3625474 1.1193 0.26300
## biomassECons -0.0032541 1.9144 0.05557
## coalECons 0.6674493 -1.3515 0.17652
## natGasEProd 0.5775109 -0.8575 0.39119
##
## iterations = 20
boxTidwell(eConsPctFF2 ~ hydroECons + windECons + biomassECons + natGasEProd, other.x = ~ coalECons + nclrECons + gdpPerCapK + gdpPctBusProfSvc + persIncPerCapK + rps2015 + gdpPctMining + pres2016PctDem + netEBalance + annMeanInsol, data = finalDatV3, max.iter = 100)## MLE of lambda Score Statistic (z) Pr(>|z|)
## hydroECons 0.281464 1.3408 0.17999
## windECons 0.505839 0.7336 0.46322
## biomassECons -0.050674 1.7487 0.08035
## natGasEProd 0.408187 -1.4790 0.13914
##
## iterations = 7
None of the added-variable plots point to a clear variable needing transformation, and the boxTidwell() function doesn’t flag any variables at the 0.05 level for transformation - biomassECons looks like a possible candidate for a logarithmic transformation, but this didn’t really improve the model’s diagnostic plots.
For the sake of demonstrating the concept, quadratic transformation of netEBalance (with centering to orthogonalize the quadratic term) will be considered.
#library(dplyr)
finalDatV2 <- finalDatV2 %>%
mutate(netEBalanceC = netEBalance - mean(netEBalance),
netEBalanceC2 = netEBalanceC^2)
adjR2Mod2Y2Quad <- update(adjR2Mod2Y2, .~. -netEBalance + netEBalanceC + netEBalanceC2)
# Confirming multicollinearity is addressed by centering
vif(adjR2Mod2Y2Quad)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 1.240631 1 1.113836
## windECons 3.998299 1 1.999575
## biomassECons 4.233611 1 2.057574
## nclrECons 1.910216 1 1.382106
## coalECons 2.577881 1 1.605578
## natGasEProd 5.569841 1 2.360051
## gdpPerCapK 5.918889 1 2.432877
## gdpPctBusProfSvc 4.311169 1 2.076336
## gdpPctMining 5.209496 1 2.282432
## pres2016PctDem 6.425137 1 2.534785
## rps2015 4.670157 2 1.470053
## persIncPerCapK 6.751694 1 2.598402
## annMeanInsol 1.930024 1 1.389253
## netEBalanceC 8.893525 1 2.982201
## netEBalanceC2 3.708920 1 1.925856
summary(adjR2Mod2Y2Quad)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + gdpPerCapK + gdpPctBusProfSvc +
## gdpPctMining + pres2016PctDem + rps2015 + persIncPerCapK +
## annMeanInsol + netEBalanceC + netEBalanceC2, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1464.15 -457.53 40.44 440.09 2276.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582e+03 2.192e+03 1.178 0.247159
## hydroECons -5.412e+00 1.301e+00 -4.160 0.000213
## windECons -5.535e+00 3.709e+00 -1.492 0.145169
## biomassECons -1.039e+01 3.502e+00 -2.968 0.005548
## nclrECons -4.728e+00 7.855e-01 -6.019 9.12e-07
## coalECons 3.675e+00 6.420e-01 5.725 2.17e-06
## natGasEProd 4.683e-01 1.869e-01 2.506 0.017306
## gdpPerCapK 5.958e+01 3.373e+01 1.767 0.086561
## gdpPctBusProfSvc 7.968e+01 3.400e+01 2.343 0.025281
## gdpPctMining 2.282e+02 5.751e+01 3.968 0.000368
## pres2016PctDem 8.777e+01 3.068e+01 2.861 0.007272
## rps2015Voluntary -3.565e+02 4.612e+02 -0.773 0.445011
## rps2015Yes -1.088e+03 5.263e+02 -2.068 0.046559
## persIncPerCapK -1.846e+02 7.970e+01 -2.316 0.026900
## annMeanInsol 1.247e+02 3.083e+02 0.404 0.688577
## netEBalanceC -6.025e-01 1.791e-01 -3.364 0.001960
## netEBalanceC2 2.465e-05 2.154e-05 1.144 0.260666
##
## Residual standard error: 881.1 on 33 degrees of freedom
## Multiple R-squared: 0.8425, Adjusted R-squared: 0.7661
## F-statistic: 11.03 on 16 and 33 DF, p-value: 5.399e-09
autoplot(adjR2Mod2Y2Quad, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2Y2Quad") +
theme_ds2()#library(MASS)
avPlots(adjR2Mod2Y2Quad, ask = FALSE, col = color_set6[2], pch = 16, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2Quad")While each of the diagnostics looks alright, there isn’t really any improvement to the model, and the centered quadratic term does not have a slope coefficient that is statistically significantly different from zero. Therefore, it will not be added to the model.
# All interactions deemed to "make some kind of sense" (to me)
finalDatV2 <- finalDatV2 %>%
mutate(hydroEConsXnetEBalance = hydroECons * netEBalance,
biomassEConsXnetEBalance = biomassECons * netEBalance,
nclrEConsXnetEBalance = nclrECons * netEBalance,
coalEConsXnetEBalance = coalECons * netEBalance,
natGasEProdXnetEBalance = natGasEProd * netEBalance,
gdpPctBusProfSvcXgdpPerCapK = gdpPctBusProfSvc * gdpPerCapK,
gdpPctMiningXcoalECons = gdpPctMining * coalECons,
gdpPctMiningXpres2016PctDem = gdpPctMining * pres2016PctDem,
gdpPerCapKXpersIncPerCapK = gdpPerCapK * persIncPerCapK,
annMeanInsolXwindECons = annMeanInsol * windECons)# Checked interaction terms one at a time - below includes the best performing interaction
#library(car)
adjR2Mod2Y2Int <- update(adjR2Mod2Y2, .~. +hydroEConsXnetEBalance)
summary(adjR2Mod2Y2Int)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol + hydroEConsXnetEBalance, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1559.48 -445.97 -24.35 330.32 2259.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.150e+03 2.038e+03 1.055 0.299154
## hydroECons -8.509e+00 2.654e+00 -3.207 0.002981
## windECons -6.084e+00 3.675e+00 -1.655 0.107327
## biomassECons -9.080e+00 3.042e+00 -2.985 0.005314
## nclrECons -4.731e+00 7.784e-01 -6.077 7.69e-07
## coalECons 3.857e+00 6.297e-01 6.126 6.67e-07
## natGasEProd 4.119e-01 1.775e-01 2.321 0.026599
## netEBalance -4.091e-01 1.577e-01 -2.595 0.014018
## gdpPerCapK 5.340e+01 3.385e+01 1.577 0.124257
## gdpPctBusProfSvc 7.139e+01 3.476e+01 2.054 0.048001
## gdpPctMining 2.236e+02 5.708e+01 3.917 0.000425
## pres2016PctDem 8.430e+01 3.042e+01 2.771 0.009113
## rps2015Voluntary -4.337e+02 4.511e+02 -0.961 0.343387
## rps2015Yes -1.075e+03 5.217e+02 -2.061 0.047256
## persIncPerCapK -1.580e+02 7.950e+01 -1.988 0.055188
## annMeanInsol 1.987e+02 2.877e+02 0.691 0.494592
## hydroEConsXnetEBalance -2.774e-03 1.995e-03 -1.391 0.173651
##
## Residual standard error: 873.1 on 33 degrees of freedom
## Multiple R-squared: 0.8453, Adjusted R-squared: 0.7703
## F-statistic: 11.27 on 16 and 33 DF, p-value: 4.098e-09
extractAIC(adjR2Mod2Y2Int)## [1] 17.000 690.435
vif(adjR2Mod2Y2Int)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 5.255185 1 2.292419
## windECons 3.996403 1 1.999101
## biomassECons 3.252785 1 1.803548
## nclrECons 1.909994 1 1.382025
## coalECons 2.524720 1 1.588937
## natGasEProd 5.115577 1 2.261764
## netEBalance 7.015798 1 2.648735
## gdpPerCapK 6.070696 1 2.463878
## gdpPctBusProfSvc 4.588855 1 2.142161
## gdpPctMining 5.225221 1 2.285874
## pres2016PctDem 6.434693 1 2.536670
## rps2015 4.523491 2 1.458372
## persIncPerCapK 6.839957 1 2.615331
## annMeanInsol 1.711121 1 1.308098
## hydroEConsXnetEBalance 7.208967 1 2.684952
autoplot(adjR2Mod2Y2Int, which = c(1,2)) +
geom_point(col = color_set6[2]) +
labs(caption = "adjR2Mod2Y2Int") +
theme_ds2()#library(MASS)
avPlots(adjR2Mod2Y2Int, ask = FALSE, col = color_set6[2], pch = 16, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2Int")The above output shows the best-performing interaction among those considered - this interaction term, while not statistically significant at level 0.05, slightly improves the model’s R2adj value (0.764 without the term, 0.770 with it) and AIC value (691.3 without, 690.4 with). It also seems to slightly improve the model’s residual plot, and doesn’t materially impact the Q-Q plot.
Therefore, I will retain the interaction term. Its coefficient can be interpreted as follows: when controlling for all other variables in the model, the slope coefficient of netEBalance is associated with an average decrease of about 0.003 for each 1-unit increase in hydroECons; both variables are measured in trillions of BTUs.
We could also state the average change in the slope coefficient of hydroECons per 1-unit increase in netEBalance, after controlling for all other variables, is about 0.003.
Since the interaction term will be retained, the final model is as shown in the summary directly above.
# Workaround to resolve issues with conflicting dplyr and MASS 'select' functions
select <- dplyr::select# Influential measures for the reduced model
#library(dplyr)
finalDatV2 <- finalDatV2 %>%
mutate(cooksD = cooks.distance(adjR2Mod2Y2Int),
dfFitS = dffits(adjR2Mod2Y2Int),
dfBetashydroECons = dfbetas(adjR2Mod2Y2Int)[,2], #col 1 is for the intercept
dfBetaswindECons = dfbetas(adjR2Mod2Y2Int)[,3],
dfBetasbiomassECons = dfbetas(adjR2Mod2Y2Int)[,4],
dfBetasnclrECons = dfbetas(adjR2Mod2Y2Int)[,5],
dfBetascoalECons = dfbetas(adjR2Mod2Y2Int)[,6],
dfBetasnatGasEProd = dfbetas(adjR2Mod2Y2Int)[,7],
dfBetasnetEBalance = dfbetas(adjR2Mod2Y2Int)[,8],
dfBetasgdpPerCapK = dfbetas(adjR2Mod2Y2Int)[,9],
dfBetasgdpPctBusProfSvc = dfbetas(adjR2Mod2Y2Int)[,10],
dfBetasgdpPctMining = dfbetas(adjR2Mod2Y2Int)[,11],
dfBetaspres2016PctDem = dfbetas(adjR2Mod2Y2Int)[,12],
dfBetasrps2015Voluntary = dfbetas(adjR2Mod2Y2Int)[,13],
dfBetasrps2015Yes = dfbetas(adjR2Mod2Y2Int)[,14],
dfBetaspersIncPerCapK = dfbetas(adjR2Mod2Y2Int)[,15],
dfBetasannMeanInsol = dfbetas(adjR2Mod2Y2Int)[,16],
dfBetashydroEConsXnetEBal = dfbetas(adjR2Mod2Y2Int)[,17],
covratio = covratio(adjR2Mod2Y2Int))
# Number of slope coefficients in the model
# used for default influential point cutoffs in DFFits and DFBetas
p <- nrow(summary(adjR2Mod2Y2Int)$coefficients) - 1 # minus 1 accounts for the slope coefficient
# Transforming the dataset to be 'ggplot friendly'
#library(tidyr)
library(stringr)
finalDatV2Trans <- finalDatV2 %>%
select(cooksD, dfFitS, dfBetashydroECons, dfBetaswindECons, dfBetasbiomassECons, dfBetasnclrECons, dfBetascoalECons, dfBetasnatGasEProd, dfBetasnetEBalance, dfBetasgdpPerCapK, dfBetasgdpPctBusProfSvc, dfBetasgdpPctMining, dfBetaspres2016PctDem, dfBetasrps2015Voluntary, dfBetasrps2015Yes, dfBetaspersIncPerCapK, dfBetasannMeanInsol, dfBetashydroEConsXnetEBal, covratio) %>%
gather(key = "metric", value = val)
#Note: seems to throw an error trying to do all this in a single pipe set
finalDatV2Trans <- finalDatV2Trans %>%
mutate(influential = if_else((metric == "cooksD")&(val > 1),"influential",
if_else((metric == "dfFitS")&(abs(val) > 2*sqrt(p/nrow(finalDatV2))),
"influential",
if_else((str_detect(metric,"dfBetas")) &
(abs(val)> 2/sqrt(nrow(finalDatV2))),"influential",
if_else((metric == "covratio") &
((val < 1 - 3*p/nrow(finalDatV2))|(val > 1 + 3*p/nrow(finalDatV2))),
"influential","not")))))#library(ggplot2)
colorCols <- c("influential" = color_set6[6], "not" = color_set6[1])
ggplot(finalDatV2Trans, aes(x = val, y = metric)) +
geom_jitter(aes(col = influential), size = 2, alpha = 0.5, width = 0, height = 0.1) +
labs(title = "Jitter Plot of Observations \n by Influence Metrics",
x = "", y = "Metric", subtitle = "'Influential' based on text-suggested defaults only", color = NULL) +
scale_color_manual(values = colorCols) +
scale_x_continuous(breaks = seq(-2, 7, 1)) +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1))While the (literal) textbook guidelines for defining influential values flag quite a few observations as ‘influential’ per the above, there are three points that are most prominent (and a fourth if including the covratio value greater than 5).
These points are the furthest-right point for dfFitS (3.07), the furthest-right point for dfBetashydroECons (1.64), and the furthest-right point for covratio (7.00). The second-to-right-most point for covratio (5.09) could be considered influential, but I will argue it is close enough to the next four points at left to fall into the ‘not’ category.
# Checking which state(s) the influential points belong to
#library(dplyr)
# State of dfFitS influential point
finalDatV2 %>% select(state, dfFitS) %>% arrange(desc(dfFitS)) %>% summarize(dfFitSState = first(state))## dfFitSState
## 1 WA
# State of dfBetashydroECons influential point
finalDatV2 %>% select(state, dfBetashydroECons) %>% arrange(desc(dfBetashydroECons)) %>% summarize(dfBhydroECState = first(state))## dfBhydroECState
## 1 WA
# State of covratio influential point
finalDatV2 %>% select(state, covratio) %>% arrange(desc(covratio)) %>% summarize(covratioState = first(state))## covratioState
## 1 TX
Two of the three points I consider influential belong to Washington, which is unsurprising - this single state accounts for nearly 30% of the entire country’s hydroelectricity consumption in 2015, which enabled it to have the lowest percentage of energy consumption from fossil fuels in 2015 (and thus the lowest value of the quadratic-transformed eConsPctFF2):
finalDatV2 %>% mutate(pctNatlHydro = round(hydroECons*100 / sum(hydroECons), 1),
eConsPctFF = round(eConsPctFF, 1)) %>%
select(state, pctNatlHydro, eConsPctFF) %>% arrange(desc(pctNatlHydro)) %>% head()## state pctNatlHydro eConsPctFF
## 1 WA 29.5 54.2
## 2 OR 12.5 57.9
## 3 NY 10.4 75.7
## 4 CA 5.5 84.6
## 5 AL 4.0 70.8
## 6 MT 4.0 77.9
It’s also not altogether surprising that Texas could be considered an influential observation - in 2015 it was the leader in both fossil fuel production and consumption by a large margin (driven by natural gas and oil production and coal, natural gas, and oil consumption, respectively). Texas also led the nation in wind energy consumption.
Running diagnostic plots with the final model after removing Washington and then Washington and Texas results in some minor changes:
Removing Washington only:
hydroECons moves from -8.51 to -1.28; the other coefficients do not change to nearly as great an extent (as might be expected)hydroECons and interaction term hydroEConsXnetEBalance decrease (5.3 to 2.9 and 7.2 to 4.4, respectively)hydroecons partial regression plot becomes much more clustered at the center, and no other plots really shift muchRemoving Texas only:
windECons moves from -6.08 to -3.97 and rps2015Voluntary moves from -4.34 to -5.44; the other coefficients do not change muchwindECons (4.0 to 2.1), coalECons (2.5 to 2.0), and natGasEProd (5.1 to 2.4)hydroecons partial regression plot becomes much more clustered at the center, and no other plots really shift muchRemoving Washington and Texas:
hydroECons moves from -8.51 to -1.26hydroECons and interaction term hydroEConsXnetEBalance again decrease and are nearly but not quite the exact same as when removing only Washington (5.3 to 2.9 and 7.2 to 4.4, respectively) - this seems reasonable because Texas consumed very little hydroelectricity (8.9 tnBTU compared to Washington’s 684.1 tnBTU)hydroecons partial regression plot’s increase clustering is the only notable shiftThis seems to support the conclusion that Washington is a moderately influential observation, but that the model performs fairly well even when it is removed.
# Sensitivity analysis, removing Washington - final model
#library(dplyr)
sensAnDatNoWA <- finalDatV2 %>% filter(state != "WA")
adjR2Mod2Y2Int1 <- update(adjR2Mod2Y2Int, .~. , data = sensAnDatNoWA)
#library(car)
summary(adjR2Mod2Y2Int1)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol + hydroEConsXnetEBalance, data = sensAnDatNoWA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1597.89 -368.59 16.01 363.81 2188.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.913e+03 2.044e+03 1.425 0.16378
## hydroECons -1.275e+01 3.678e+00 -3.467 0.00152
## windECons -5.898e+00 3.589e+00 -1.643 0.11012
## biomassECons -9.278e+00 2.972e+00 -3.122 0.00379
## nclrECons -4.507e+00 7.721e-01 -5.838 1.74e-06
## coalECons 3.613e+00 6.326e-01 5.711 2.51e-06
## natGasEProd 4.172e-01 1.732e-01 2.408 0.02196
## netEBalance -4.172e-01 1.540e-01 -2.710 0.01073
## gdpPerCapK 6.035e+01 3.332e+01 1.811 0.07948
## gdpPctBusProfSvc 6.919e+01 3.396e+01 2.038 0.04994
## gdpPctMining 2.175e+02 5.583e+01 3.895 0.00047
## pres2016PctDem 7.768e+01 2.997e+01 2.592 0.01427
## rps2015Voluntary -4.834e+02 4.414e+02 -1.095 0.28158
## rps2015Yes -9.855e+02 5.121e+02 -1.924 0.06326
## persIncPerCapK -1.790e+02 7.866e+01 -2.276 0.02970
## annMeanInsol 1.964e+02 2.808e+02 0.699 0.48946
## hydroEConsXnetEBalance -3.353e-03 1.979e-03 -1.694 0.09994
##
## Residual standard error: 852.2 on 32 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.766
## F-statistic: 10.82 on 16 and 32 DF, p-value: 9.948e-09
extractAIC(adjR2Mod2Y2Int1)## [1] 17.0000 674.4099
vif(adjR2Mod2Y2Int1)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 2.870690 1 1.694311
## windECons 3.983773 1 1.995939
## biomassECons 3.245993 1 1.801664
## nclrECons 1.966840 1 1.402441
## coalECons 2.639417 1 1.624628
## natGasEProd 5.100796 1 2.258494
## netEBalance 6.996964 1 2.645178
## gdpPerCapK 6.073956 1 2.464540
## gdpPctBusProfSvc 4.566808 1 2.137009
## gdpPctMining 5.217414 1 2.284166
## pres2016PctDem 6.422095 1 2.534185
## rps2015 4.589867 2 1.463693
## persIncPerCapK 6.936413 1 2.633707
## annMeanInsol 1.690594 1 1.300228
## hydroEConsXnetEBalance 4.410494 1 2.100118
autoplot(adjR2Mod2Y2Int1, which = c(1,2)) +
geom_point(col = color_set6[2], shape = 15) +
labs(caption = "adjR2Mod2Y2Int1 - WA removed") +
theme_ds2()#library(MASS)
avPlots(adjR2Mod2Y2Int1, ask = FALSE, col = color_set6[2], pch = 15, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2Int minus WA")# Sensitivity analysis, removing Texas - final model
sensAnDatNoTX <- finalDatV2 %>% filter(state != "TX")
adjR2Mod2Y2Int2 <- update(adjR2Mod2Y2Int, .~. , data = sensAnDatNoTX)
summary(adjR2Mod2Y2Int2)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol + hydroEConsXnetEBalance, data = sensAnDatNoTX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1522.1 -469.8 1.5 400.8 2266.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.063e+03 2.055e+03 1.004 0.323142
## hydroECons -8.560e+00 2.672e+00 -3.203 0.003072
## windECons -3.970e+00 4.658e+00 -0.852 0.400405
## biomassECons -9.704e+00 3.175e+00 -3.057 0.004492
## nclrECons -4.803e+00 7.897e-01 -6.082 8.55e-07
## coalECons 3.914e+00 6.385e-01 6.130 7.45e-07
## natGasEProd 4.852e-01 2.038e-01 2.381 0.023394
## netEBalance -4.212e-01 1.596e-01 -2.640 0.012709
## gdpPerCapK 5.093e+01 3.424e+01 1.488 0.146670
## gdpPctBusProfSvc 7.292e+01 3.506e+01 2.080 0.045618
## gdpPctMining 2.176e+02 5.801e+01 3.751 0.000702
## pres2016PctDem 8.753e+01 3.093e+01 2.830 0.007981
## rps2015Voluntary -5.439e+02 4.775e+02 -1.139 0.263161
## rps2015Yes -1.204e+03 5.528e+02 -2.178 0.036873
## persIncPerCapK -1.538e+02 8.024e+01 -1.916 0.064310
## annMeanInsol 1.893e+02 2.899e+02 0.653 0.518578
## hydroEConsXnetEBalance -2.760e-03 2.008e-03 -1.374 0.178882
##
## Residual standard error: 879 on 32 degrees of freedom
## Multiple R-squared: 0.8448, Adjusted R-squared: 0.7671
## F-statistic: 10.88 on 16 and 32 DF, p-value: 9.251e-09
extractAIC(adjR2Mod2Y2Int2)## [1] 17.0000 677.4483
vif(adjR2Mod2Y2Int2)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 5.245327 1 2.290268
## windECons 2.146595 1 1.465126
## biomassECons 3.284892 1 1.812427
## nclrECons 1.890054 1 1.374792
## coalECons 1.991016 1 1.411034
## natGasEProd 2.386892 1 1.544957
## netEBalance 6.052484 1 2.460180
## gdpPerCapK 6.077657 1 2.465291
## gdpPctBusProfSvc 4.564916 1 2.136566
## gdpPctMining 5.137734 1 2.266657
## pres2016PctDem 6.562241 1 2.561687
## rps2015 5.107780 2 1.503343
## persIncPerCapK 6.855858 1 2.618369
## annMeanInsol 1.660778 1 1.288712
## hydroEConsXnetEBalance 7.160088 1 2.675834
autoplot(adjR2Mod2Y2Int2, which = c(1,2)) +
geom_point(col = color_set6[2], shape = 18, size = 2.5) +
labs(caption = "adjR2Mod2Y2Int2 - TX removed") +
theme_ds2()avPlots(adjR2Mod2Y2Int2, ask = FALSE, col = color_set6[2], pch = 18, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2Int minus TX")# Sensitivity analysis, removing Washington AND Texas - final model
sensAnDatNoWATX <- finalDatV2 %>% filter(state != "WA" & state != "TX")
adjR2Mod2Y2Int3 <- update(adjR2Mod2Y2Int, .~. , data = sensAnDatNoWATX)
summary(adjR2Mod2Y2Int3)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol + hydroEConsXnetEBalance, data = sensAnDatNoWATX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1561.87 -424.17 28.35 349.69 2195.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.822e+03 2.063e+03 1.368 0.181146
## hydroECons -1.276e+01 3.706e+00 -3.443 0.001667
## windECons -3.883e+00 4.549e+00 -0.854 0.399869
## biomassECons -9.871e+00 3.102e+00 -3.182 0.003315
## nclrECons -4.579e+00 7.839e-01 -5.841 1.93e-06
## coalECons 3.670e+00 6.421e-01 5.716 2.76e-06
## natGasEProd 4.870e-01 1.990e-01 2.447 0.020269
## netEBalance -4.287e-01 1.559e-01 -2.750 0.009864
## gdpPerCapK 5.793e+01 3.373e+01 1.718 0.095837
## gdpPctBusProfSvc 7.067e+01 3.427e+01 2.062 0.047656
## gdpPctMining 2.118e+02 5.677e+01 3.731 0.000766
## pres2016PctDem 8.083e+01 3.050e+01 2.650 0.012553
## rps2015Voluntary -5.880e+02 4.672e+02 -1.259 0.217555
## rps2015Yes -1.109e+03 5.432e+02 -2.042 0.049687
## persIncPerCapK -1.747e+02 7.946e+01 -2.199 0.035467
## annMeanInsol 1.874e+02 2.832e+02 0.662 0.513087
## hydroEConsXnetEBalance -3.334e-03 1.994e-03 -1.672 0.104564
##
## Residual standard error: 858.5 on 31 degrees of freedom
## Multiple R-squared: 0.8434, Adjusted R-squared: 0.7625
## F-statistic: 10.43 on 16 and 31 DF, p-value: 2.234e-08
extractAIC(adjR2Mod2Y2Int3)## [1] 17.0000 661.5129
vif(adjR2Mod2Y2Int3)## GVIF Df GVIF^(1/(2*Df))
## hydroECons 2.859289 1 1.690943
## windECons 2.105088 1 1.450892
## biomassECons 3.273207 1 1.809201
## nclrECons 1.947795 1 1.395634
## coalECons 2.080072 1 1.442245
## natGasEProd 2.375995 1 1.541426
## netEBalance 6.037777 1 2.457189
## gdpPerCapK 6.079600 1 2.465684
## gdpPctBusProfSvc 4.544554 1 2.131796
## gdpPctMining 5.129546 1 2.264850
## pres2016PctDem 6.552933 1 2.559870
## rps2015 5.180111 2 1.508637
## persIncPerCapK 6.956695 1 2.637555
## annMeanInsol 1.641521 1 1.281219
## hydroEConsXnetEBalance 4.374909 1 2.091628
autoplot(adjR2Mod2Y2Int3, which = c(1,2)) +
geom_point(col = color_set6[2], shape = 17, size = 2.5) +
labs(caption = "adjR2Mod2Y2Int3 - WA & TX removed") +
theme_ds2()avPlots(adjR2Mod2Y2Int3, ask = FALSE, col = color_set6[2], pch = 17, col.lines = "blue", lwd = 1.5,
grid = FALSE, main = "Partial Regression Plots: adjR2Mod2Y2Int minus WA & TX")The final model’s components are described in the below summary:
summary(adjR2Mod2Y2Int)##
## Call:
## lm(formula = eConsPctFF2 ~ hydroECons + windECons + biomassECons +
## nclrECons + coalECons + natGasEProd + netEBalance + gdpPerCapK +
## gdpPctBusProfSvc + gdpPctMining + pres2016PctDem + rps2015 +
## persIncPerCapK + annMeanInsol + hydroEConsXnetEBalance, data = finalDatV2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1559.48 -445.97 -24.35 330.32 2259.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.150e+03 2.038e+03 1.055 0.299154
## hydroECons -8.509e+00 2.654e+00 -3.207 0.002981
## windECons -6.084e+00 3.675e+00 -1.655 0.107327
## biomassECons -9.080e+00 3.042e+00 -2.985 0.005314
## nclrECons -4.731e+00 7.784e-01 -6.077 7.69e-07
## coalECons 3.857e+00 6.297e-01 6.126 6.67e-07
## natGasEProd 4.119e-01 1.775e-01 2.321 0.026599
## netEBalance -4.091e-01 1.577e-01 -2.595 0.014018
## gdpPerCapK 5.340e+01 3.385e+01 1.577 0.124257
## gdpPctBusProfSvc 7.139e+01 3.476e+01 2.054 0.048001
## gdpPctMining 2.236e+02 5.708e+01 3.917 0.000425
## pres2016PctDem 8.430e+01 3.042e+01 2.771 0.009113
## rps2015Voluntary -4.337e+02 4.511e+02 -0.961 0.343387
## rps2015Yes -1.075e+03 5.217e+02 -2.061 0.047256
## persIncPerCapK -1.580e+02 7.950e+01 -1.988 0.055188
## annMeanInsol 1.987e+02 2.877e+02 0.691 0.494592
## hydroEConsXnetEBalance -2.774e-03 1.995e-03 -1.391 0.173651
##
## Residual standard error: 873.1 on 33 degrees of freedom
## Multiple R-squared: 0.8453, Adjusted R-squared: 0.7703
## F-statistic: 11.27 on 16 and 33 DF, p-value: 4.098e-09
The final model can be interepreted thusly:
hydroECons (hydroelectricity consumption, tnBTUs)windECons (wind power consumption, tnBTUs)biomassECons (biomass power consumption, tnBTUs)nclrECons (nuclear power consumption, tnBTUs)netEBalance (difference between total energy production and consumption, tnBTUs - positive indicating a net surplus)rps2015 either ‘Voluntary’ or ‘Yes’ (having either voluntary or mandatory targeted renewable energy legislation)persIncPerCapK (personal per-capita income in $1,000s)hydroEConsXnetEBalance (the interaction term between hydroECons and netEBalance)coalECons (coal energy consumption, tnBTUs)natGasEProd (natural gas energy production, tnBTUs)gdpPerCapK (state gross domestic product per capita in $1,000s)gdpPctBusProfSvc (percentage of state GDP derived from ‘business or professional services’)gdpPctMining (percentage of state GDP derived from mining or mineral extraction)pres2016PctDem (percentage of the 2016 Presidential Election for the Democratic Party candidate)annMeanInsol (solar energy potential, in megawatt hours per square meter per day)Including meanAnnInsol (mean annual insolation, roughly ‘available solar energy per square meter’) in the final model is somewhat curious - while it has a high p-value, indicating it does not have a slope coefficient statistically significantly different than zero, excluding it from the model negatively impacts the residual plot - residual plots for adjR2Mod1 and adjR2Mod1Y2 show this in the section Considering Transformations on eConsPctFF (Y).
In a similar vein, windECons, gdpPerCapK and the interaction term hydroEConsXnetEBalance are not statistically significant at level 0.05, but their removal from the model negatively impacts diagnostic plots and model summary metrics.
I am not positive but suspect that this means these variables represent imperfect measures of some relevant element(s) not currently captured in the model. For example, meanAnnInsol may relate to something characteristic of southern states (which have higher meanAnnInsol values), and windECons may relate to a factor inherent to states in the center of the country (where wind energy consumption is higher).
After a lengthy analysis (and a perhaps still lengthier report write-up!), I believe this report demonstrates the utility and application of much of the diagnostic tools covered during our MAT 8406 Regression Methods course.
While it would be excellent to be able to provide a policy-oriented recommendation stemming from this analysis, I believe that this analysis and the final model therein performs best as a potential descriptive tool for better understanding the connection between various energy-source, political, and economic elements and the relative fossil-fuel intensity of the different states in 2015.
A strong underlying theory and ideally a stronger understanding of locale-specific economic/political/etc. constraints would be very helpful in being able to extend much beyond that.
Additionally, while the model holds up well overall across the various diagnostics applied, there are some remaining uncertainties to consider such as whether there may be superior data sources to some of those employed in the final model. Specifically, there are several variables which are not considered statistically significant at relatively large significance levels (say >0.05), and it may be that there are underlying mechanisms at play which could be better described otherwise.
This analysis (or something similar, perhaps more strongly connected to CO2 emissions directly) would seem to lend itself to time-series analysis, and developing an understanding of modeled factors contributing to fossil fuel consumption (and/or emissions) over time.
All apologies for the following models not being to exact equal scale between the right and left plots in each set - the revised setups resulted in slightly different chart setups and I ran out of time to ‘calibrate’ the relative widths as was done in the presentation plot.
# Note: 'Presentation Version' in the body of the report contains the embolden() function used in the last three plots, and the ridgeOrder, fillCols, fullList, and boldList vectors used for the first several plots
#library(dplyr)
#library(ggplot2)
# LOG-MODIFIED PRESENTATION VERSION - LOG TRANSFORMATION OF EACH CONSUMPTION VALUE
# Need to add a small positive value to each consumption value to avoid log(0)'s
# Note: This results in relatively large negative values for '0' consumption entries
eConsPlot1a <- ggplot(eConsDat, aes(x = log(tnBTU +1e-7), y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(-20,15,5)) +
scale_y_discrete(limits = ridgeOrder) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category - Per-State",
x = "log(tnBTUs)", y = "Energy Source", subtitle = "log-transformed",
caption = "Values at -16.1 reflect '0' consumption") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList, boldList)))
eConsPlot2a <- ggplot(eConsDat, aes(y = log(tnBTU +1e-7), x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(-20,15,5)) +
scale_x_discrete(limits = ridgeOrder) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category - Per-State",
y = "log(tnBTUs)", x = "Energy Source", subtitle = "log-transformed",
caption = "Values at -16.1 reflect '0' consumption") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank()) +
coord_flip()
#library(cowplot)
plot_grid(eConsPlot1a, eConsPlot2a, ncol = 2, align = "h", rel_widths = c(1.35,0.8))# PER CAPITA-MODIFIED PRESENTATION VERSION - EACH CONSUMPTION VALUE PER CAPITA BY STATE
# Note: statePopK is in 1,000s - tnBTU / statPopK would result in bnBTUs per person
# The below is rescaled to mnBTUs per person
eConsPlot1b <- ggplot(eConsDat, aes(x = tnBTU*1000/statePopK, y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(0,800,200)) +
scale_y_discrete(limits = ridgeOrder) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category - Per-State",
x = "mnBTUs/person", y = "Energy Source", subtitle = "Per Capita",
caption = "Note: rescaled") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList, boldList)))
eConsPlot2b <- ggplot(eConsDat, aes(y = tnBTU*1000/statePopK, x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0,800,200)) +
scale_x_discrete(limits = ridgeOrder) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category - Per-State",
y = "mnBTUs/person", x = "Energy Source", subtitle = "Per Capita",
caption = "Note: rescaled") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank()) +
coord_flip()
#library(cowplot)
plot_grid(eConsPlot1b, eConsPlot2b, ncol = 2, align = "h", rel_widths = c(1.35,0.8))# MODIFIED PRESENTATION VERSION - Total Non-Renewables / Total Fossil Fuel / Total 'Other' / Total Renewables
# New lookup table to map factor levels of eConSource
lut2 <- c("rETotCons" = "Total Renewables", "othETotCons" = "Total 'Other'", "ffETotCons" = "Total Fossil Fuels", "nonRETotCons" = "Total Non-Renewables")
eConsDat2 <- finalDat %>%
select(state, rETotCons,
biomassECons, nclrECons, ffETotCons) %>%
mutate(othETotCons = biomassECons + nclrECons,
nonRETotCons = ffETotCons + othETotCons) %>%
select(state, rETotCons, othETotCons, ffETotCons, nonRETotCons) %>%
gather(key = "eConSource", value = tnBTU, -state) %>%
mutate(eConSource = factor(lut2[eConSource],
levels = c("Total Renewables", "Total 'Other'", "Total Fossil Fuels", "Total Non-Renewables")),
eType = factor(if_else(eConSource == "Total Renewables", "renewable",
if_else(eConSource == "Total Fossil Fuels", "fossil fuel",
"other")),
levels = c("other", "fossil fuel", "renewable")))
# Adding in state population in thousands - needs to be a separate mutate() step after eConsDat2 dataset is created
eConsDat2 <- eConsDat2 %>%
mutate(statePopK = rep(finalDat$statePopK, nrow(eConsDat2) / n_distinct(state)))
# Vector to set desired order of density ridges
ridgeOrder2 <- c("Total Renewables", "Total 'Other'", "Total Fossil Fuels", "Total Non-Renewables")
# same fillCols as prior eConsDat setup
# Create boldList2 and fullList2 for the second plot set
boldList2 <- ("Total Non-Renewables")
fullList2 <- factor(ridgeOrder2, levels = ridgeOrder2)
eConsPlot1c <- ggplot(eConsDat2, aes(x = tnBTU, y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(0,12000,3000)) +
scale_y_discrete(limits = ridgeOrder2) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category Totals - Per-State",
x = "tnBTUs", y = "Energy Category") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList2, boldList2)),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11))
eConsPlot2c <- ggplot(eConsDat2, aes(y = tnBTU, x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0,12000,3000)) +
scale_x_discrete(limits = ridgeOrder2) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category Totals - Per-State",
y = "tnBTUs", x = "Energy Category") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11)) +
coord_flip()
#library(cowplot)
plot_grid(eConsPlot1c, eConsPlot2c, ncol = 2, align = "h", rel_widths = c(1.55,0.8))# MODIFIED VERSION LOG TRANSFORMED
eConsPlot1d <- ggplot(eConsDat2, aes(x = log(tnBTU), y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(0,10,2)) +
scale_y_discrete(limits = ridgeOrder2) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category Totals - Per-State",
x = "log(tnBTUs)", y = "Energy Category", subtitle = "log-transformed") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList2, boldList2)),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11))
eConsPlot2d <- ggplot(eConsDat2, aes(y = log(tnBTU), x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0,10,2)) +
scale_x_discrete(limits = ridgeOrder2) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category Totals - Per-State",
y = "log(tnBTUs)", x = "Energy Category", subtitle = "log-transformed") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11)) +
coord_flip()
#library(cowplot)
plot_grid(eConsPlot1d, eConsPlot2d, ncol = 2, align = "h", rel_widths = c(1.55,0.8))# MODIFIED VERSION PER-CAPITA
eConsPlot1e <- ggplot(eConsDat2, aes(x = tnBTU*1000/statePopK, y = eConSource, fill = eType)) +
geom_density_ridges(scale = 1, rel_min_height = 0.01) +
scale_x_continuous(breaks = seq(0,1200,300)) +
scale_y_discrete(limits = ridgeOrder2) +
scale_fill_manual(values = fillCols, guide = FALSE) +
labs(title = "Density Ridge Plot of Energy Consumption \n by Category Totals - Per-State",
x = "mnBTUs/person", y = "Energy Category", subtitle = "Per Capita",
caption = "Note: rescaled") +
theme_ds2() +
theme(panel.grid.major.y = element_line(color = "lightgrey", linetype = 1)) +
theme(axis.text.y=element_text(face=embolden(fullList2, boldList2)),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11))
eConsPlot2e <- ggplot(eConsDat2, aes(y = tnBTU*1000/statePopK, x = eConSource, col = eType)) +
geom_boxplot() +
scale_y_continuous(breaks = seq(0,1200,300)) +
scale_x_discrete(limits = ridgeOrder2) +
scale_color_manual(values = fillCols, guide = FALSE) +
labs(title = "Boxplots of Energy Consumption \n by Category Totals - Per-State",
y = "mnBTUs/person", x = "Energy Category", subtitle = "Per Capita",
caption = "Note: rescaled") +
theme_ds2() +
theme(axis.title.y = element_blank(), axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5, face = "plain", size = 11)) +
coord_flip()
#library(cowplot)
plot_grid(eConsPlot1e, eConsPlot2e, ncol = 2, align = "h", rel_widths = c(1.55,0.8))# Quick Summary of All Variables
summary(finalDat)## state gthrmECons hydroECons solarECons
## AK : 1 Min. : 0.0210 Min. : 0.00 Min. : 0.003
## AL : 1 1st Qu.: 0.3468 1st Qu.: 7.30 1st Qu.: 0.128
## AR : 1 Median : 0.9625 Median : 14.74 Median : 1.012
## AZ : 1 Mean : 4.2363 Mean : 46.42 Mean : 8.523
## CA : 1 3rd Qu.: 2.1260 3rd Qu.: 31.62 3rd Qu.: 4.220
## CO : 1 Max. :112.8780 Max. :684.06 Max. :211.763
## (Other):44
## windECons rETotCons biomassECons nclrECons
## Min. : 0.0000 Min. : 0.477 Min. : 2.883 Min. : 0.0
## 1st Qu.: 0.1208 1st Qu.: 22.061 1st Qu.: 34.054 1st Qu.: 0.0
## Median : 10.4190 Median : 43.788 Median : 70.814 Median : 101.9
## Mean : 35.5461 Mean : 94.729 Mean : 90.291 Mean : 166.7
## 3rd Qu.: 40.8230 3rd Qu.: 98.477 3rd Qu.:133.198 3rd Qu.: 285.4
## Max. :417.8020 Max. :751.780 Max. :288.651 Max. :1017.4
##
## coalECons natGasECons petroECons ffETotCons
## Min. : 0.00 Min. : 0.196 Min. : 81.95 Min. : 94.19
## 1st Qu.: 27.63 1st Qu.: 199.786 1st Qu.: 236.50 1st Qu.: 616.76
## Median : 253.77 Median : 317.535 Median : 489.73 Median : 1086.48
## Mean : 310.98 Mean : 562.921 Mean : 713.95 Mean : 1587.85
## 3rd Qu.: 408.16 3rd Qu.: 708.940 3rd Qu.: 776.38 3rd Qu.: 1718.67
## Max. :1340.42 Max. :4263.857 Max. :6180.60 Max. :11784.88
##
## biomassEProd coalEProd natGasEProd petroEProd
## Min. : 2.113 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 23.166 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 70.479 Median : 0.91 Median : 3.594 Median : 5.609
## Mean : 91.734 Mean : 358.62 Mean : 627.345 Mean : 328.408
## 3rd Qu.:125.219 3rd Qu.: 351.00 3rd Qu.: 365.694 3rd Qu.: 146.671
## Max. :537.517 Max. :6538.24 Max. :9449.581 Max. :7223.915
##
## ffETotProd totECons totEProd eConsIndex
## Min. : 0.00 Min. : 127.1 Min. : 3.821 Min. :-95.89
## 1st Qu.: 0.00 1st Qu.: 807.0 1st Qu.: 260.680 1st Qu.:-86.97
## Median : 37.74 Median : 1483.1 Median : 661.080 Median :-76.32
## Mean : 1314.37 Mean : 1939.6 Mean : 1667.574 Mean :-74.03
## 3rd Qu.: 1381.82 3rd Qu.: 2325.6 3rd Qu.: 1611.727 3rd Qu.:-67.16
## Max. :17144.75 Max. :12847.8 Max. :18129.647 Max. :-18.28
##
## eConsPctRE eConsPctFF netEBalance statePopK
## Min. : 0.07972 Min. :54.19 Min. :-4471.70 Min. : 587
## 1st Qu.: 1.58595 1st Qu.:72.42 1st Qu.:-1338.43 1st Qu.: 1854
## Median : 3.23839 Median :79.79 Median : -600.83 Median : 4547
## Mean : 6.37768 Mean :80.41 Mean : -272.03 Mean : 6405
## 3rd Qu.: 8.22625 3rd Qu.:90.21 3rd Qu.: -82.17 3rd Qu.: 7074
## Max. :35.91322 Max. :96.89 Max. : 8242.02 Max. :38994
##
## gdpPerCapK gdpTotMn gdpPctBusProfSvc gdpPctGovt
## Min. :31.70 Min. : 30356 Min. :19.70 Min. : 9.13
## 1st Qu.:42.78 1st Qu.: 84563 1st Qu.:28.88 1st Qu.:11.12
## Median :47.24 Median : 210271 Median :32.90 Median :12.40
## Mean :48.46 Mean : 357750 Mean :33.78 Mean :13.32
## 3rd Qu.:53.83 3rd Qu.: 467207 3rd Qu.:39.21 3rd Qu.:14.49
## Max. :67.63 Max. :2505853 Max. :56.31 Max. :23.27
##
## gdpPctHosp gdpPctManuf gdpPctMining gdpPctUtil
## Min. : 2.043 Min. : 2.178 Min. : 0.05278 Min. :0.5719
## 1st Qu.: 2.508 1st Qu.: 8.029 1st Qu.: 0.17120 1st Qu.:1.4010
## Median : 2.774 Median :11.272 Median : 0.54746 Median :1.6422
## Mean : 3.265 Mean :12.060 Mean : 2.79237 Mean :1.7018
## 3rd Qu.: 3.342 3rd Qu.:16.151 3rd Qu.: 2.09536 3rd Qu.:1.8678
## Max. :14.890 Max. :29.408 Max. :21.52534 Max. :2.7887
##
## gdpPctOth pres2016PctDem house2014PctDem rps2015
## Min. :23.04 Min. :22.43 Min. : 0.00 No :13
## 1st Qu.:31.08 1st Qu.:36.02 1st Qu.: 14.88 Voluntary: 8
## Median :32.70 Median :46.17 Median : 29.17 Yes :29
## Mean :33.08 Mean :44.02 Mean : 38.13
## 3rd Qu.:34.57 3rd Qu.:51.51 3rd Qu.: 58.89
## Max. :42.30 Max. :62.22 Max. :100.00
##
## persIncPerCapK eduPctLTHS eduPctHS eduPctSmColAssoc
## Min. :21.06 Min. : 7.20 Min. :20.70 Min. :23.20
## 1st Qu.:25.44 1st Qu.: 9.15 1st Qu.:26.70 1st Qu.:27.75
## Median :27.67 Median :11.15 Median :28.50 Median :29.95
## Mean :28.49 Mean :11.77 Mean :28.98 Mean :30.23
## 3rd Qu.:30.98 3rd Qu.:14.40 3rd Qu.:31.30 3rd Qu.:32.60
## Max. :38.80 Max. :18.20 Max. :40.70 Max. :37.20
##
## eduPctBac eduPctGradProf annMeanInsol annMeanWs50m
## Min. :11.70 Min. : 7.400 Min. :2.512 Min. :3.877
## 1st Qu.:16.50 1st Qu.: 8.775 1st Qu.:3.763 1st Qu.:4.598
## Median :18.30 Median :10.200 Median :3.993 Median :4.961
## Mean :18.28 Mean :10.736 Mean :4.143 Mean :5.347
## 3rd Qu.:20.07 3rd Qu.:11.900 3rd Qu.:4.407 3rd Qu.:6.119
## Max. :24.10 Max. :17.700 Max. :5.920 Max. :7.161
##
## stateName eduPctHSOrLess eduPctBacGradProf
## Length:50 Min. :31.10 Min. :19.10
## Class :character 1st Qu.:36.83 1st Qu.:25.95
## Mode :character Median :40.55 Median :28.15
## Mean :40.76 Mean :29.01
## 3rd Qu.:43.25 3rd Qu.:31.77
## Max. :55.70 Max. :40.50
##
library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
scale_fill_gradient(low = "lightgrey", high = blues_set5[5]),
theme_ds2())
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = statePopK/1000), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "Pop. (mn)", low = "lightgrey", high = blues_set5[5]) +
labs(title = "Population in Millions", x = "", y = "")library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
scale_fill_gradient(low = "lightgrey", high = blues_set5[5]),
theme_ds2())
# State GDP in billions of USD
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpTotMn/1000), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "Bn USD", low = "lightgrey", high = blues_set5[5]) +
labs(title = "Total State GDP, in Billions of USD", x = "", y = "")# Scatterplot of state population in millions and GDP in billions
ggplot(finalDat, aes(x = statePopK/1000, y = gdpTotMn/1000)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_text(aes(label = state), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 0.7) +
labs(title = "Plot of State GDP on State Population",
x = "State Population, in Millions",
y = "Total State GDP, in Billions of USD") +
theme_ds2()# Proportion of GDP from Business and Professional Services
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctBusProfSvc), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Business and Professional Services",
x = "", y = "")# Proportion of GDP from Government
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctGovt), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Government",
x = "", y = "")# Proportion of GDP from Hospitality
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctHosp), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Hospitality",
x = "", y = "")# Proportion of GDP from Manufacturing
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctManuf), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Manufacturing",
x = "", y = "")# Proportion of GDP from Mining
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctMining), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Mining",
x = "", y = "")# Proportion of GDP from Utilities
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctUtil), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Utilities",
x = "", y = "")# Proportion of GDP from Other Industries
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPctOth), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Proportion of State GDP from \n Other Industries",
x = "", y = "")# Plot US states by education levels and per capita GDP / income
library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
scale_fill_gradient(low = "lightgrey", high = blues_set5[5]),
theme_ds2())
# Less than High School Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctLTHS), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with Less than High School Education",
x = "", y = "")# High School Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctHS), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with High School Education",
x = "", y = "")# High School or Less Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctHSOrLess), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with High School Education or Less",
x = "", y = "")# Associate's Degree or Some College Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctSmColAssoc), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with an Associate's Degree \n or Some College",
x = "", y = "")# Bachelor's Degree Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctBac), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with a Bachelor's Degree",
x = "", y = "")# Graduate or Professional Degree Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctGradProf), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with a Graduate or Professional Degree",
x = "", y = "")# Bachelor's Degree, Graduate, or Professional Degree Education
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eduPctBacGradProf), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "% of State Population Age 25+ \n with at least a Bachelor's Degree",
x = "", y = "")# Per Capita GDP by State
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gdpPerCapK), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Per Capita GDP by State, in 1,000 USD", x = "", y = "")# Per Capita Personal Income by State
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = persIncPerCapK), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
labs(title = "Per Capita Personal Income by State, in 1,000 USD", x = "", y = "")# Note: r = 0.815 correlation betw. EduPctGradProf and PersIncpercapk
# Comparing Per Capita GDP by State and Per Capita Personal Income by State
ggplot(finalDat, aes(x = gdpPerCapK, y = persIncPerCapK)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_text(aes(label = state), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 0.7) +
labs(title = "Plot of Per Capita GDP by State \n on Per Capita Personal Income by State",
x = "Per Capita GDP (in 1,000s USD) by State",
y = "Per Capita Personal Income \n (in 1,000s USD) by State") +
theme_ds2()# Comparing % Democrat Representatives (2014 midterm elections) vs.
# % votes for Democrat
library(ggplot2)
ggplot(finalDat, aes(x = house2014PctDem, y = pres2016PctDem)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_text(aes(label = state), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 0.7) +
labs(title = "Plot of % 2016 Presidential Vote Share for Democratic Candidate \n on % Democratic Rep Composition in 2014 Election",
x = "% of 2014 Representatives from Democractic Party",
y = "% of 2016 Presidential Votes \n for Democratic Party Candidate") +
theme_ds2()library(ggplot2)
library(fiftystater)
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eConsIndex), color = "darkgrey", map = fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map() +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
scale_fill_gradient(name = "% RE Cons - % FF Cons", low = color_set6[5], high = color_set6[4]) +
labs(title = "% Renewable Energy minus % Fossil Fuel \n Consumption in 2015",
x = "", y = "") +
theme_ds2()# Scatterplot using the same two variables
ggplot(data = finalDat, aes(x = eConsPctRE/100, y = eConsPctRE/100)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_text(aes(label = state), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 0.7) +
labs(title = "Plot of % Fossil Fuel Consumption \n on % Renewable Energy Consumption",
x = "% Renewable Energy Consumption",
y = "% Fossil Fuel Consumption") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
theme_ds2()library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2())
# Total Energy Consumption in Trillions of BTUs
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[6]) +
labs(title = "Total Energy Consumption", x = "", y = "",
caption = "BTU = British Thermal Unit; Energy needed to heat 0.45kg of water 1F")# Total Energy Production in Trillions of BTUs
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = totEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[6]) +
labs(title = "Total Energy Production",
x = "", y = "")# Net Energy Balance in Trillions of BTUs
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = netEBalance), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = color_set6[1], high = color_set6[6]) +
labs(title = "Net Energy Balance",
x = "", y = "", subtitle = "Positive values indicate energy exports")# Ratio of Total Consumption to Total Production
# Created based on post-presentation feedback
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = (totECons / totEProd)), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "Consumption:Production Ratio", low = color_set6[1], high = color_set6[6]) +
labs(title = "Energy Balance - Ratio of Consumption to Production",
x = "", y = "", subtitle = ">1: Cons > Prod; <1: Cons < Prod",
caption = "Only DE stands out for high condumption relative to production")library(ggplot2)
library(fiftystater)
library(cowplot)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2(),
theme(legend.position = "bottom", legend.text = element_text(size = 11)))
# Renewable Energy Production/Consumption in Trillions of BTUs
rePlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = rETotCons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[4]) +
labs(title = "Renewable Energy Consumption \n Total",
x = "", y = "")
# Renewable Energy Production/Consumption as % of Total Energy Consumption
rePlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eConsPctRE/100), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[4],
labels = scales::percent) +
labs(title = "Renewable Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(rePlot1, rePlot2, align = "h")# Geothermal Energy Production/Consumption (treated as equivalent) in Trillions of BTUs
gthrmPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gthrmECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[4]) +
labs(title = "Geothermal Energy Consumption \n Total",
x = "", y = "")
gthrmPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = gthrmECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[4],
labels = scales::percent) +
labs(title = "Geothermal Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(gthrmPlot1, gthrmPlot2, align = "h")# Hydropower Energy Production/Consumption (treated as equivalent) in Trillions of BTUs
hydroPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = hydroECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[4]) +
labs(title = "Hydropower Energy Consumption \n Total",
x = "", y = "")
hydroPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = hydroECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[4],
labels = scales::percent) +
labs(title = "Hydropower Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(hydroPlot1, hydroPlot2, align = "h")# Solar Energy Production/Consumption (treated as equivalent) in Trillions of BTUs
solarPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = solarECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[4]) +
labs(title = "Solar Energy Consumption \n Total",
x = "", y = "")
solarPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = solarECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[4],
labels = scales::percent) +
labs(title = "Solar Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(solarPlot1, solarPlot2, align = "h")# Wind Energy Production/Consumption (treated as equivalent) in Trillions of BTUs
windPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = windECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[4]) +
labs(title = "Wind Energy Consumption \n Total",
x = "", y = "")
windPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = windECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[4],
labels = scales::percent) +
labs(title = "Wind Energy Consumption \n % of Total Consumption",
x = "", y = "") +
theme(legend.text = element_text(size = 8))
plot_grid(windPlot1, windPlot2, align = "h")# Nuclear Energy Production/Consumption (treated as equivalent) in Trillions of BTUs
nclrPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = nclrECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[6]) +
labs(title = "Nuclear Energy Consumption \n Total",
x = "", y = "")
nclrPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = nclrECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[6],
labels = scales::percent) +
labs(title = "Nuclear Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(nclrPlot1, nclrPlot2, align = "h")# Biomass Energy Consumption in Trillions of BTUs
bmsConsPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = biomassECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[6]) +
labs(title = "Biomass Energy Consumption \n Total",
x = "", y = "")
bmsConsPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = biomassECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[6],
labels = scales::percent) +
labs(title = "Biomass Energy Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(bmsConsPlot1, bmsConsPlot2, align = "h")# Biomass Energy Production in Trillions of BTUs
bmsProdPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = biomassEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[6]) +
labs(title = "Biomass Energy Production \n Total",
x = "", y = "")
bmsProdPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = biomassEProd/totEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Prod", low = "lightgrey", high = color_set6[6],
labels = scales::percent) +
labs(title = "Biomass Energy Production \n % of Total Production",
x = "", y = "")
plot_grid(bmsProdPlot1, bmsProdPlot2, align = "h")rm(rePlot1, rePlot2, gthrmPlot1, gthrmPlot2, hydroPlot1, hydroPlot2, solarPlot1, solarPlot2, windPlot1, windPlot2, nclrPlot1, nclrPlot2, bmsConsPlot1, bmsConsPlot2, bmsProdPlot1, bmsProdPlot2)library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2(),
theme(legend.position = "bottom", legend.text = element_text(size = 11)))
# Fossil Fuel Consumption in Trillions of BTUs
ffConsPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = ffETotCons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Fossil Fuel Consumption \n Total",
x = "", y = "")
# Fossil Fuel Energy Consumption as % of Total Energy Consumption
ffConsPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = eConsPctFF/100), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Fossil Fuel Consumption \n % of Total Consumption",
x = "", y = "") +
theme(legend.text = element_text(size = 10))
plot_grid(ffConsPlot1, ffConsPlot2, align = "h")# Fossil Fuel Production in Trillions of BTUs
ffProdPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = ffETotProd), color = "darkgrey",
map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Fossil Fuel Production \n Total",
x = "", y = "")
ffProdPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = ffETotProd/totEProd), color = "darkgrey",
map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Prod", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Fossil Fuel Production \n % of Total Production",
x = "", y = "")
plot_grid(ffProdPlot1, ffProdPlot2, align = "h")# Coal Consumption in Trillions of BTUs
coalConsPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = coalECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Coal Consumption \n Total",
x = "", y = "")
coalConsPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = coalECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[5],
labels = scales::percent)
labs(title = "Coal Consumption \n % of Total Consumption",
x = "", y = "")## $title
## [1] "Coal Consumption \n % of Total Consumption"
##
## $x
## [1] ""
##
## $y
## [1] ""
##
## attr(,"class")
## [1] "labels"
plot_grid(coalConsPlot1, coalConsPlot2, align = "h") # Coal Production in Trillions of BTUs
coalProdPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = coalEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Coal Production \n Total",
x = "", y = "")
coalProdPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = coalEProd/totEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Prod", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Coal Production \n % of Total Production",
x = "", y = "")
plot_grid(coalProdPlot1, coalProdPlot2, align = "h")# Natural Gas Consumption in Trillions of BTUs
ngConsPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = natGasECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Natural Gas Consumption \n Total",
x = "", y = "")
ngConsPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = natGasECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Natural Gas Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(ngConsPlot1, ngConsPlot2, align = "h") # Natural Gas Production in Trillions of BTUs
ngProdPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = natGasEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Natural Gas Production \n Total",
x = "", y = "")
ngProdPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = natGasEProd/totEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Prod", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Natural Gas Production \n % of Total Production",
x = "", y = "")
plot_grid(ngProdPlot1, ngProdPlot2, align = "h")# Petroleum Consumption in Trillions of BTUs
petroConsPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = petroECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Petroleum Consumption \n Total",
x = "", y = "")
petroConsPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = petroECons/totECons), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Cons", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Petroleum Consumption \n % of Total Consumption",
x = "", y = "")
plot_grid(petroConsPlot1, petroConsPlot2, align = "h")# Crude Oil Production in Trillions of BTUs
oilProdPlot1 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = petroEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "tnBTUs", low = "lightgrey", high = color_set6[5]) +
labs(title = "Crude Oil Production \n Total",
x = "", y = "")
oilProdPlot2 <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = petroEProd/totEProd), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "% Total E Prod", low = "lightgrey", high = color_set6[5],
labels = scales::percent) +
labs(title = "Crude Oil Production \n % of Total Production",
x = "", y = "")
plot_grid(oilProdPlot1, oilProdPlot2, align = "h")library(ggplot2)
library(fiftystater)
# Create some default plot settings to reduce lengthy code chunks
map_plot_defaults <- list(expand_limits(x = fifty_states$long, y = fifty_states$lat),
coord_map(),
scale_x_continuous(breaks = NULL),
scale_y_continuous(breaks = NULL),
theme_ds2())
# Mean Annual Solar Insolation
insolPlot <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = annMeanInsol), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = expression(paste("kW/m",""^2," per day")),
low = color_set6[1], high = color_set6[4]) +
labs(title = expression(paste("Mean Annual Solar Insolation (kW/m",""^2," per day)")),
x = "", y = "") +
theme(legend.position = "bottom")
# Mean Annual Wind Speed at 50m off the ground
windSpdPlot <- ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = annMeanWs50m), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_gradient(name = "mph",
low = color_set6[1], high = color_set6[4]) +
labs(title = expression(paste("Mean Annual Wind Speed at Height 50m (mph)")),
x = "", y = "") +
theme(legend.position = "bottom")
insolPlotwindSpdPlotlibrary(cowplot)
nasaPlots <- plot_grid(insolPlot, windSpdPlot, ncol = 2, align = "h")
nasaPlotsggsave(filename = "C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/Graphics/nasaPlots.png", plot = nasaPlots, width = 10, height = 6)
# Scatterplot of Mean Wind Speed at 50m off the ground on Mean Solar Insolation
# Using relative percentiles for each variable
# Generate percentile assignments to Solar Insolation and Wind Speed variables,
# then convert the percentile labels to numeric values tied to the correct state
library(dplyr)
library(stringr)
dat1 <- finalDat %>% arrange(annMeanInsol) %>%
mutate(insolPctl = names(quantile(annMeanInsol,
probs = seq(0, 1, (1/(nrow(finalDat)-1)))))) %>%
mutate(insolPctl = as.numeric(str_remove(insolPctl,"%"))) %>%
select(state, insolPctl)
dat2 <- finalDat %>% arrange(annMeanWs50m) %>%
mutate(ws50mPctl = names(quantile(annMeanWs50m,
probs = seq(0, 1, (1/(nrow(finalDat)-1)))))) %>%
mutate(ws50mPctl = as.numeric(str_remove(ws50mPctl,"%"))) %>%
select(state, ws50mPctl)
dat <- left_join(dat1, dat2, by = "state")
rm(dat1, dat2)
ggplot(dat, aes(x = insolPctl, y = ws50mPctl)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_text(aes(label = state), alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, size = 0.7) +
geom_vline(xintercept = 50, col = "lightgrey") +
geom_hline(yintercept = 50, col = "lightgrey") +
labs(title = "Plot of Pctls, Annual Mean Wind Speed at Height 50m \n on Annual Mean Solar Insolation",
x = expression(paste("Pctl Mean Annual Solar Insolation (kW/m",""^2," per day)")),
y = "Pctl Mean Annual Wind Speed \n at Height 50m (mph)") +
theme_ds2()# Map plot of Renewable Energy Portfolio Standards (RPS)
fill_colors <- c("No" = blues_set5[5], "Voluntary" = blues_set5[3], "Yes" = blues_set5[2])
ggplot(data = finalDat, aes(map_id = stateName)) +
geom_map(aes(fill = rps2015), color = "darkgrey", map = fifty_states) +
map_plot_defaults +
scale_fill_manual(values = fill_colors) +
labs(title = expression(paste("2015 Renewable Portfolio Standards (RPS) by State")),
x = "", y = "")rm(dat, fill_colors, nasaPlots)# Annual behavior of monthly means for solar insolation and wind speed at 50m
usInsolWind <- read.csv("C:/Users/Duane/Documents/Academic/Villanova/2. Spring 18/MAT 8406_Regression Methods/Project/us_1to8926.csv")
# Grouping by State and Variable and Summarizing
library(dplyr)
library(tidyr)
# Note: Need to filter out one entry for Ontario (very close to int'l boundary)
# And DC (not in the scope of the analysis)
usInsol <- usInsolWind %>%
select(-X, -LAT, -LON, -COUNTRY, -LOC, -PAIR) %>%
filter(PARAMETER == "ALLSKY_SFC_SW_DWN" & STATE != "ON" & STATE != "DC") %>%
rename(state = STATE) %>%
group_by(state) %>%
summarize(Jan = mean(JAN), Feb = mean(FEB), Mar = mean(MAR),
Apr = mean(APR), May = mean(MAY), Jun = mean(JUN),
Jul = mean(JUL), Aug = mean(AUG), Sep = mean(SEP),
Oct = mean(OCT), Nov = mean(NOV), Dec = mean(DEC)) %>%
gather(key = month, value = meanInsol, -state) %>%
mutate(month = match(month, month.abb))
# Note: some states seem to not align with the rest for mean_insol trending;
# The below code extracts these (HI is the "_max" and AK is the "_min" for each pair)
usInsolWinterMin <- usInsol %>%
filter(month == 1) %>% arrange(meanInsol) %>% head(n=1)
usInsolWinterMax <- usInsol %>%
filter(month == 1) %>% arrange(meanInsol) %>% tail(n=1)
usInsolFallMin <- usInsol %>%
filter(month == 10) %>% arrange(meanInsol) %>% head(n=1)
usInsolFallMax <- usInsol %>%
filter(month == 10) %>% arrange(meanInsol) %>% tail(n=1)
usWs50m <- usInsolWind %>%
select(-X, -LAT, -LON, -COUNTRY, -LOC, -PAIR) %>%
filter(PARAMETER == "WS50M" & STATE != "ON" & STATE != "DC") %>%
rename(state = STATE) %>%
group_by(state) %>%
summarize(Jan = mean(JAN), Feb = mean(FEB), Mar = mean(MAR),
Apr = mean(APR), May = mean(MAY), Jun = mean(JUN),
Jul = mean(JUL), Aug = mean(AUG), Sep = mean(SEP),
Oct = mean(OCT), Nov = mean(NOV), Dec = mean(DEC)) %>%
gather(key = month, value = meanWs50m, -state) %>%
mutate(month = match(month, month.abb))
ggplot(data = usInsol, aes(x = month, y = meanInsol)) +
geom_line(aes(col = state), alpha = 0.6) +
geom_text(data = usInsolWinterMax,
aes(x = month, y = meanInsol, label = state)) +
geom_text(data = usInsolWinterMin,
aes(x = month, y = meanInsol, label = state)) +
labs(title = "Trend in Mean Solar Insolation \n Across the Year Among States",
x = "Month",
y = expression(paste("Mean Solar Insolation (kW/m",""^2," per day)")),
caption = "HI and AK are the high and low outliers, respectively, across fall and winter") +
scale_color_discrete(guide = FALSE) +
scale_x_continuous(breaks = seq(1,12,1)) +
theme_ds2()ggplot(data = usWs50m, aes(x = month, y = meanWs50m)) +
geom_line(aes(col = state), alpha = 0.6) +
labs(title = "Trend in Mean Wind Speed at Height 50m \n Across the Year Among States",
x = "Month",
y = "Mean Wind Speed \n at Height 50m (mph)") +
scale_color_discrete(guide = FALSE) +
scale_x_continuous(breaks = seq(1,12,1)) +
theme_ds2()rm(usInsolWind, usInsol, usInsolFallMax, usInsolFallMin, usInsolWinterMax, usInsolWinterMin, usWs50m)library(dplyr)
library(corrplot)## corrplot 0.84 loaded
# Determine only numeric variables in dataset
varTypes <- sapply(finalDat, class)
numericVars <- names(which((varTypes == "numeric") | (varTypes == "integer")))
numericDat <- finalDat %>% select(numericVars)
# Create correlation plot among numeric variables
scaleCols <- colorRampPalette(c(color_set6[1], "#FFFFFF", color_set6[6]))
corrplot(cor(numericDat), method = "square", type = "upper", order = "AOE",
col = scaleCols(100), tl.col = "black", tl.cex = 0.8, cl.align.text = "l")rm(varTypes, numericVars, numericDat, scaleCols)